Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Desync for Australian/New Zealand users on US headless sessions with high player count #3039

Open
Stellanora64 opened this issue Sep 19, 2021 · 27 comments
Labels
Bug Something isn't working Network Network issues or problems Waiting on fix confirmation

Comments

@Stellanora64
Copy link

Stellanora64 commented Sep 19, 2021

Describe the bug?

For primarily Australian and New Zealand users that are on a US headless servers with a higher player count (around 8 to 10 or more) Desync begins to occur.

Note:

  • that desync does begin to reduce as the amount of users in the session reduce (is playable at around 4 to 5 or lower).
  • With the update to the LNL networking protocol this cap has increased to 6 - 7 if no one is building or spawning objects into the world. If these events do to occur queued packets will accumulate before gradually decreasing (at around 200 packets per second).
  • Also note that after testing latency that is above 100 - 150ms (This is as reported by Neos, although this is only one way as reported here: Ping Values Incorrect - Half of What They Should Be #3544 , so more accurately would be 200 - 300ms ) will give users an extreme likelihood of desyncing ( at a rate of ~200 packets per second) under the conditions listed above.

Desync begins to affect primarily:

  • Inspector panels taking 100 to 200 seconds to show any slots. (This being due to inspectors being generated by the host and thus desynced users need to wait for the host to generate the UI)

  • Note: With the addition of the queued packets node This time is directly proportional to the amount of queued packets a given user is experiencing. ( Roughly every 1000 QP is the equivalent of 10 seconds of delay. However this is just an estimate as this ratio varies on ping and other factors like bandwidth usage)

  • objects other users have grabbed and moved only catching up after around 100 seconds (the same time it takes for inspectors to catch up).

  • other users voice modes. for example if someone, that is not the desynced user, changes their voice mode to mute they will not be muted for the desynced user until the desync has catches up which they will be muted mid sentence.
    (note that their is no desync between player to player communication, like talking to another user is not affected by desync as well as their avatar's movements)

  • objects other users have spawned only appearing for the desynced user once desync catches up.

  • any changes other users have made, like changing a material, mesh etc.

  • When a user changes their avatar, they appear frozen until the desync catches up. This also stops all player to player interaction that isn't normally been affected by desync like talking and avatar movement/player position, but only for the user that has changed their avatar.

Relevant issues

no apparent relevant issues that haven't already been fixed.

To Reproduce

Reproducing this issue may be hard to replicate for users other than Australian and New Zealand users but is a consistent issue in creator jams where the player count is at around 10 or more and the headless is US.

the main things needed to reproduce the bug are to be connected to a headless that is generally a very far distance from where you are playing (a VPN may help with this) and to have the session have 10 or more users.

Expected behavior

to have no desync in headless that are in a different region to users that join, and for there to be no desync with a high player count session.

Log Files

DESKTOP-F65B6FQ - 2021.9.16.105 - 2021-09-19 10_57_51.log

the issue occurs at 1:28:57 PM.179

Screenshots

No response

How often does it happen?

Always

Does the bug persist after restarting Neos?

Yes

Neos Version Number

2021.9.16.105

What Platforms does this occur on?

Windows, Linux

Link to Reproduction Item/World

No response

Did this work before?

I Don't Know

If it worked before, on which build?

No response

Additional context

No response

Reporters

Neos: Crusher
Discord: Crusher#6146

@Stellanora64 Stellanora64 added the Bug Something isn't working label Sep 19, 2021
@shadowpanther
Copy link
Contributor

As far as I know, Australia has a very low-bandwidth uplink to the rest of the world, so even if your local connection has high bandwidth, connection to any host outside of Australia would be rate-limited. Desync happens because that rate-limit is saturated with streams from all users coming through the host user. Datamodel changes have lower priority and thus changes get delayed.

I'm not sure what the solution to this would be apart from adding more uplink cables from Australia to the rest of the world to add bandwidth. Maybe mesh networking would help a bit as you would have multiple connections to different hosts instead of one.

@shadowpanther shadowpanther added the Network Network issues or problems label Sep 19, 2021
@Stellanora64
Copy link
Author

Stellanora64 commented Sep 19, 2021

That does make sense, although the only thing that I would believe has a higher priority is the voice mode of other users but this doesn't seem to be the case.

@ohzee00
Copy link

ohzee00 commented Sep 19, 2021

Sadly I'm unsure how much can be done here(besides perhaps mesh networking linked above) however the only possible suggestion is asking users to change their settings to Steam networking sockets in Neos.
Note though, this only works if the host has steam being it runs through their service, headlesses might need to be configured with that in mind.

Mind you this isn't a direct fix, I just remember helping some users before that were desyncing badly due to latency and that Steam networking sockets helped them somewhat, at least allowed them to play in a semi-usable state.

@Frooxius
Copy link
Collaborator

Frooxius commented Sep 19, 2021

Unfortunately this is a limitation of the current transport protocol that we are using, it unfortunately degrades with certain connections (typically high latency, but can also be just a result of quirks of given connection).

You can try switching to "Prefer Steam Networking Sockets" in the settings, which can behave a lot better in these scenarios, but unfortunately currently doesn't have bandwidth estimation, so it can end up dropping the connection as well.

We are currently waiting on Valve to implement this feature so we can switch to it as the main protocol (or specifically the open version Game Networking Sockets).

https://github.com/ValveSoftware/GameNetworkingSockets

@Enverex
Copy link

Enverex commented Sep 20, 2021

Would this also be why people on very low speed connections (e.g. ~3Mb/s down, 0.5Mb up ) are unable to ever reach sync, even in a world with one other person? I have a friend in France and regardless of whether he is host or client, even with a single other person present with almost nothing else happening, he will not be able to maintain sync.

@shadowpanther
Copy link
Contributor

doesn't have bandwidth estimation

Speaking of, if bandwidth estimation would be implemented (by SNS or GNS), could it be used by Neos to downgrade audio streams bitrate and maybe pose stream frequency to allow for eventual sync for clients with very limited bandwidth?

@iamgreaser
Copy link

GreaseMonkey here.

The server we were getting desynced on is the main Creator Jam hub server which is hosted in Ukraine. I get a ~305 msec ping to it.

For reference, about 8 years ago I was getting pings of about 200 msec to US West, 260 msec to US East, and 350 msec to Europe. Nowadays I get about 180 msec to US West and I haven't remeasured the rest.

I think I'm hitting the LNL Relay connection in several cases, even though I know that my NAT handles UDP holepunching just fine.

Connection speed is... pretty decent? I suspect it does boil down to latency.

@Lexevolution
Copy link

Whenever I had any major desync on some sessions, and switched to SNS, I sometimes got this strange side effect where everything seemed like it was synced, but in realty, all of my actions, voice and my perspective of the world was 30+ seconds behind. And to everyone else, I was lagging behind their conversations/actions by 30+ seconds. It looked very different to the regular desyncing issue which doesn't usually affect voices.

@Frooxius
Copy link
Collaborator

@Enverex Don't know, we'd need to gather data on that. Does Steam Networking Sockets make a difference? Could be lots of things with the connection.

@shadowpanther That's unlikely, typically that detail is hidden in the protocol itself and not accessible. Generally those don't pose much of an issue anyways, since those are using streams and can be lost. It's the reliable changes that start queuing up. Usually it's not even the bandwidth itself, but rather packet rate.

@iamgreaser Sometimes it's the combination of connections that just don't work with UDP holepunching. We've had cases where you could have connection A, B and C. A and B would work fine, B and C would work fine with each other, but A and C will always go through the relay.

@Lexevolution Steam Networking Socket handles things differently, which is why you get that effect. Essentially you hit the fixed bandwidth limit, so everything starts trickling through and delaying like that. It's why we need the bandwidth estimation to be implemented in the protocol before we can switch to it as primary one.

@sveken
Copy link

sveken commented Oct 2, 2021

Is there any roadmap or really rough date when we could see that happening? Unfortunately the desync issue for Australians which i only experience on Neos pretty much locks us out of things like the MetaMovie which was a ton of fun when it worked and other big events,

@Frooxius
Copy link
Collaborator

Frooxius commented Oct 3, 2021

@sveken We asked on the bandwidth estimation here: ValveSoftware/GameNetworkingSockets#108 But currently there's no ETA on their end so we just have to wait and see before the switch. I saw some movement for bandwidth estimation a few months ago.

There are a few things that I'm looking into on our end though that might help improve the network performance, mainly with combining smaller messages into a bigger one to reduce the overall packet-rate.

@Frooxius
Copy link
Collaborator

I pushed upgraded LNL library in 2021.10.25.1351, which should have a number of improvements that should help with this.

Can you give that a go and see if it's any better please?

@sveken
Copy link

sveken commented Oct 25, 2021

Will give a little test tonight and i have booked another Metamovie ticket for the 30th to test there, as that is where i ran into the most desync issues with all the cool things that go on. Will report back afterwards,

Just to confirm, i am best to disable the "Prefer Steam Sockets" now with the new update?

@Stellanora64
Copy link
Author

Stellanora64 commented Oct 25, 2021

I'll do the same this evening, and I'll see how it goes with the creatorjam this weekend as it is really consistent with desync there.

@sveken yes. Disable preferring steam sockets as the updated libraries only affect LNL networks. (from what I understand)

@sikirebirth
Copy link

To continue/answer the questions asked in the other issue,

  • Qued packets were seen using the user list thing in essentials and other tools that other people made that show que'd packets
  • I'm unsure about the active protocol question? assuming if it's the steam network or not, if that's the case, it's been left unchanged, so non-steam.
  • I'll generate more logs and attach them here next time I jump on neos

@Frooxius
Copy link
Collaborator

@sikirebirth If you can get a screenshots of the user list in the essentials that can help! Did you check the queued packets yourself on your end or did the host check?

Ideally if you can get the host to check what it says for you that can help, because the value you're seeing won't be quite up to date, due to the data model being delayed.

@sikirebirth
Copy link

The qued packets were always checked by hosts of the worlds I was in, and other people of the world.
i am unsure when i can jump back on neos for this week but i will post the requested things whenever i do.

@Frooxius
Copy link
Collaborator

Sweet thanks for the info! We'll see if we get some more data from others in the meanwhile too.

@BigRedWolfy
Copy link

This weekend I'd like to see how the changes to networking affect both DeSync and ReSync sessions. Another temporary solution to try limit the issue I'll eventually get around to is spreading both sessions into multiple smaller sessions with tighter controls to limit the number of people in one session, around 5 maybe more, by allowing many more people to be connected to the headless spread over multiple worlds and having items syncronised that may be present in more than one individual session similar to VBLFC using nested sessions. Currently this is much more preferable than organising a private intercontinental network bridge between the two current headless sessions running, since such an arrangement would be very expensive

@sveken
Copy link

sveken commented Oct 27, 2021

I have only done limited testing so far, (still waiting for the weekend).
I do think there is definitely an improvement i was able to interact normally in a world with 17 people perfectly fine, However as soon as the 18th person joined i noticed the queued packets quickly starting to climb and then stabilize at 24,000 as reported by the user list thingy in the world.

As soon as the user count dropped to 17 or lower the queued packets rapidly started to drop down and go back to 0.
Neos was only using 1.5Mbits/s of my bandwidth, this was a headless server world. Will report back with how Metamovie goes.

@Frooxius
Copy link
Collaborator

@sveken Thanks for the info!

  • Who was hosting this session?
  • What is your upload/download?
  • Did anyone else experience queued packets in the session?
  • Is the 1.5 Mbit/s of your bandwidth up or down?

It almost sounds like it hit some bandwidth limit on the host.

@BigRedWolfy Limiting the session can definitely help as it lowers the overall amount of bandwidth, but we'd like to make sure there can be as many people as possible.

There's still aspect that the updated LiteNetLib doesn't help much - it uses the sliding window algorithm, which doesn't scale super well with large latency. The Game Networking Sockets should work much better in this regard, but we're still waiting on bandwidth estimation.

@sveken
Copy link

sveken commented Oct 27, 2021

I can't remember unfortunately, is there a previously visited section i can check for you?
My download/upload is 30/30Mbits/s over 4G
I was the only one experiencing large amounts of consistent queued packets, some other users would get a brief 300-800 but it would disappear.
The 1.5Mbits/s is just what was reported by Task Manager, i can monitor the specific up/down on the router next time if that helps.

@Frooxius
Copy link
Collaborator

Thanks for the info! Do you know at least know where they were located and what the ping was?

@Stellanora64
Copy link
Author

Stellanora64 commented Oct 28, 2021

After some testing desync is still occurring (but only occurring at around 16 to 18+ users in the session), and once the session reaches a certain amount of network traffic the queued packets increase around a 100 packets every 2 to 3 seconds without stopping, unless the player count decreases.

But the updated libraries have certainly helped as I only start getting desync once the player count is around 16 to 18 users which is nearly an additional 10 users than before the update, and the queued packets catch up relatively quickly only being around 30 to 40 seconds until I'm fully synced once the player count is below the threshold.

The session had ~140 ping when testing.

Here are the logs from the session. Most of the desync occurred at 6:28:39 PM.075

DESKTOP-F65B6FQ - 2021.10.26.9 - 2021-10-28 18_03_48.log

@sveken
Copy link

sveken commented Oct 30, 2021

Just did the MetaMovie again,
The starting was much better only starting what felt like 5 seconds of dysync compared to 5 minutes before the update.
However further along in the story the deysnc did get worse as more things happened, i was told my high score was 50K queued packets.
Bandwidth did not go over 1.3mb/s on the download for Neos.
World had 14 people in it.

Definitely an improvement however.

@rabbuttz
Copy link

rabbuttz commented Nov 2, 2021

I don't know if my issue is related to this, but I have recently been experiencing out-of-sync issues in certain sessions. This has been happening more frequently since the recent update regarding the network came in. Specifically, item locations, Logix processing, etc. are out of sync. Strangely enough, the voice and user locations seem to be perfectly in sync, and when I check the QueuedPackets in Neos, it shows 0. Sometimes I can't see the user even though they are supposed to be there. When I go to the dash menu and look at the session details, it looks like the user is not there. This problem did not seem to be related to whether or not I was using SteamNetworkingSockets. My internet speed is 368.0Mbps↓/29.5Mbps↑. The logs are attached. The problem occurs when I'm in a session called "SLOT開発室", around 3:00.
https://www.dropbox.com/s/69oi5kvwi3tvj5f/DESKTOP-CB6HJR0%20-%202021.10.30.625%20-%202021-11-01%2020_12_06.log?dl=0

@Nutcake
Copy link

Nutcake commented Feb 7, 2022

Hello, I've also been experiencing this issue the past two weeks and just wanted to add another datapoint.

I've been trying to play on a headless server located in the US (Washington to be specific) and I am located in Germany with a 500 MBit/s down, 50 MBit/s up DOCSIS 3.1 based connection. My latency to the server according to the userlist is around 60-80ms (tho I have been told that number is just a single direction, so double that for RTT I guess?). I can play fine with around 6-8 people on the server but more than that and I quickly start getting a massive packet-queue in the hundred-thousands and rising and extreme desync. The connection is an LNL connection and I've tried both directly connecting via IP and by joining the session through my contacts-list which seems to use NAT-Punchthrough.

I've also tried to use German and US-based servers from a high speed VPN service to connect just as an experiment and that made no difference.

Another user from Germany has experienced the same issue in that server and I know that we both use the same ISP provided router (Arris TG3442DE), so since I wanted to get a better router anyways I'm going to buy a new one soon and check if that has any influence on this issue.

I hope this can be resolved soon, as this issue completely prevents me from taking part in the weekend events my community is hosting.

Edit: An update to this, we figured out that the world that the headless server was using had a clock in it with extremely unoptimized logix that was spamming network packets every tick. Removing that clock seems to have fixed the issue for me enitrely, though we were "only" able to test it with around 16 people.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Network Network issues or problems Waiting on fix confirmation
Projects
None yet
Development

No branches or pull requests