Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onserialize improvements #302

Merged
merged 6 commits into from
Feb 1, 2019
Merged

Onserialize improvements #302

merged 6 commits into from
Feb 1, 2019

Conversation

miwarnec
Copy link
Collaborator

did some more tests. I can get it down to 0.6MB GC and time closer to barriers time.
+100% safe. will never fail.
+no barrier magic
-not as fast as barrier

current:
current

barriers:
barriers

current-optimized:
positionmagic_optimized and no lengthcheck for maxpacketsize

extra: current-optimized + static writer caching:
positionmagic_optimized_cachedwriter

this old UNET bug caused me so much pain, I'd really like to keep this fail safe so that a component can never touch another component's data, no matter what. or do you guys think the few ms there are worth it for barriers?

@paulpach
Copy link
Contributor

While this is good, you need to profile client as well, your patch does nothing to help the client side.

@paulpach
Copy link
Contributor

paulpach commented Jan 16, 2019

Barrier pros:

  1. Barriers eliminate a ton of allocations on the client too
  2. Your patch adds 4 bytes per component overhead, Barriers add 1
  3. If you read less data than available, barriers will detect it 99.6% of the time (or higher). Your patch will not detect it at all. This is the kind of thing you are trying to prevent in the first place. This one thing alone is enough for me to prefer barriers. Nothing worse than a silently ignored bug.
  4. Barriers are a bit faster as shown in your profiler. This is probably because barriers happen to be more cache friendly.
  5. Barriers are safer: With your patch, if a malicious user crafts a message and puts 2GB length for components, they can cause unbound allocations, and make the app crash. This only affects DeSerialization, which happen in the client. This may be a security concern for P2P games, where a malicious user has control over the server.

#302 pros:

  1. It detects too many reads in 100% of the cases. Barriers will detect 99.6% of the cases (probably a bit higher than that)

In my mind barriers might be a tad simpler, but that can be argued either way.

@miwarnec
Copy link
Collaborator Author

miwarnec commented Jan 16, 2019

  1. possible with current version too
  2. true
  3. possible with current one too. didn't matter too much since nothing bad will happen if they don't read until the end. but yea, should add that check just to be sure.
  4. true
  5. true, but at the same time in a P2P game barriers would allow someone to manipulate other components. but not like p2p is ever gonna happen anyway

about the 'Barrier' value: can we use something that is recognizable in hex, like 0xEE?

another thing that comes to mind:
back when I found this bug in UNET, I had if/else cases in On(De)Serialize. so it wouldn't happen all the time, only randomly once per day if my if/else case got hit in On(De)Serialize. if we run an MMO with thousands of players in production and we have one of those on(de)serialize bugs in there, still undetected:

-> barriers: would show error when it happens. everything after that component gets out of sync. this will cause lots of frustration if all of the sudden the monsters start looking like players. not even joking, this happened back when I first encountered this bug. we have to assume that 50% of the users will see the barriers error message and not report it. if they even see it - it's likely that more errors pop up afterwards. this would be a very stressful situation for everyone involved. users can't play after random moments, developers get negative reviews about random bugs, developers have to rush out fixes and stay up all night, etc.

-> current: would show error. error might be ignored. user can still play because everything else still works, except for example the mount component. eventually the developer hears about it and can look into it without being under stress, since people can still play for the most part.

imho there is a huge difference between 'after 1h of playing, everything fails' and 'after 1h of playing I can't use mounts anymore'.

@paulpach
Copy link
Contributor

paulpach commented Jan 16, 2019

possible with current version too

Yes, but keep in mind it makes the code more complicated too.

possible with current one too. didn't matter too much since nothing bad will happen if they don't read until the end

Yes, it is possible to do with current one too. It makes it more complicated though.
if this happens it does matter. If the last field in a component read too little data, well, you have one corrupted field, but if your first field in the component was the one that read too little data, your entire component is corrupted, and there would not be any message in the log to indicate that this happened.

Sure, the current barrier is 0xAB we can display it in hex if it makes life easier.
I chose that value because I examined several binaries and that happen to be one of the least used bytes in real world. Thus it has one of the highest probability of detecting the error. It is out of ascii range, so if you send a lot of text back and forth, this would be even less common.

barriers: would show error when it happens. everything after that component gets out of sync. this will cause lots of frustration if all of the sudden the monsters start looking like players. not even joking, this happened back when I first encountered this bug. we have to assume that 50% of the users will see the barriers error message and not report it. if they even see it - it's likely that more errors pop up afterwards. this would be a very stressful situation for everyone involved. users can't play after random moments, developers get negative reviews about random bugs, developers have to rush out fixes and stay up all night, etc.

Perhaps we should throw and exception and abort the whole synchronization with the first barrier mismatch? Notice the random bug problem is even more likely with the current approach because of the under read problem (point 3). You may have a bug, it goes unnoticed because the current system does not report it, you go to production and people see random behavior due to corrupted reads.

current: would show error. error might be ignored. user can still play because everything else still works, except for example the mount component. eventually the developer hears about it and can look into it without being under stress, since people can still play for the most part.

I don't think trying to recover from corruption is the right approach. Ideally the entire message should be discarded and an error should be logged instead. Either way, it is possible to read corrupted data in both approaches, so there is no way to prevent this 100%

@miwarnec
Copy link
Collaborator Author

under-read problem is fixed now in master.
let's think about this a bit longer until we are sure.

@imerr
Copy link
Contributor

imerr commented Jan 19, 2019

Wouldn't something like this eliminate the extra reader allocation on the client or am I missing some internal workings here that don't allow the reader to be reset?

        internal void OnDeserializeSafely(NetworkBehaviour comp, NetworkReader reader, bool initialState) {
            int startPosition = reader.Position;
            int size = reader.ReadInt32();
            // call OnDeserialize with a temporary reader, so that the
            // original one can't be messed with. we also wrap it in a
            // try-catch block so there's no way to mess up another
            // component's deserialization
            try
            {
                comp.OnDeserialize(reader, initialState);
                if (reader.Position != startPosition + size) 
                {
                    int actualRead = reader.Position - startPosition;
                    if (actualRead > size) 
                    {
                       throw new Exception("Read too much data (should "  + size + ", did " + actualRead);
                    } 
                    else 
                    {
                       throw new Exception("Read too little data (should "  + size + ", did " + actualRead);
                    }
                }
            }
            catch (Exception e)
            {
                reader.Position = startPosition + size;
                // show a detailed error and let the user know what went wrong
                Debug.LogError("OnDeserialize failed for: object=" + name + " component=" + comp.GetType() + " sceneId=" + m_SceneId + " length=" + size+ ". Possible Reasons:\n  * Do " + comp.GetType() + "'s OnSerialize and OnDeserialize calls write the same amount of data(" + size +" bytes)? \n  * Was there an exception in " + comp.GetType() + "'s OnSerialize/OnDeserialize code?\n  * Are the server and client the exact same project?\n  * Maybe this OnDeserialize call was meant for another GameObject? The sceneIds can easily get out of sync if the Hierarchy was modified only in the client OR the server. Try rebuilding both.\n\n" + e.ToString());
            }
        }

@imerr
Copy link
Contributor

imerr commented Jan 19, 2019

This PR currently doesnt work for me, I get deserialization errors in a simple scene with just a NetworkTransform

OnDeserialize failed for: object=NPC(Clone) component=Mirror.NetworkTransform sceneId=0 length=0. Possible Reasons:
  * Do Mirror.NetworkTransform's OnSerialize and OnDeserialize calls write the same amount of data(0 bytes)? 
  * Was there an exception in Mirror.NetworkTransform's OnSerialize/OnDeserialize code?
  * Are the server and client the exact same project?
  * Maybe this OnDeserialize call was meant for another GameObject? The sceneIds can easily get out of sync if the Hierarchy was modified only in the client OR the server. Try rebuilding both.

System.IO.EndOfStreamException: Failed to read past end of stream.
  at System.IO.BinaryReader.FillBuffer (Int32 numBytes) [0x0006c] in /Users/builduser/buildslave/mono/build/mcs/class/corlib/System.IO/BinaryReader.cs:124 
  at System.IO.BinaryReader.ReadSingle () [0x00000] in /Users/builduser/buildslave/mono/build/mcs/class/corlib/System.IO/BinaryReader.cs:506 
  at Mirror.NetworkReader.ReadSingle () [0x00007] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkReader.cs:32 
  at Mirror.NetworkReader.ReadVector3 () [0x00002] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkReader.cs:135 
  at Mirror.NetworkTransformBase.DeserializeFromReader (Mirror.NetworkReader reader) [0x00009] in D:\Projects\unity\Mirror\Assets\Mirror\Components\NetworkTransformBase.cs:116 
  at Mirror.NetworkTransformBase.OnDeserialize (Mirror.NetworkReader reader, Boolean initialState) [0x00003] in D:\Projects\unity\Mirror\Assets\Mirror\Components\NetworkTransformBase.cs:213 
  at Mirror.NetworkIdentity.OnDeserializeSafely (Mirror.NetworkBehaviour comp, Mirror.NetworkReader reader, Boolean initialState) [0x0007a] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:561 
UnityEngine.DebugLogHandler:Internal_Log(LogType, String, Object)
UnityEngine.DebugLogHandler:LogFormat(LogType, Object, String, Object[])
UnityEngine.Logger:Log(LogType, Object)
UnityEngine.Debug:LogError(Object)
Mirror.NetworkIdentity:OnDeserializeSafely(NetworkBehaviour, NetworkReader, Boolean) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:570)
Mirror.NetworkIdentity:OnDeserializeAllSafely(NetworkBehaviour[], NetworkReader, Boolean) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:586)
Mirror.NetworkIdentity:OnUpdateVars(NetworkReader, Boolean) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:654)
Mirror.ClientScene:ApplySpawnPayload(NetworkIdentity, Vector3, Quaternion, Byte[], UInt32) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\ClientScene.cs:406)
Mirror.ClientScene:OnSpawnPrefab(NetworkMessage) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\ClientScene.cs:458)
Mirror.NetworkConnection:HandleBytes(Byte[]) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkConnection.cs:226)
Mirror.NetworkConnection:TransportReceive(Byte[]) (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkConnection.cs:276)
Mirror.NetworkClient:Update() (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkClient.cs:192)
Mirror.NetworkClient:UpdateClients() (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkClient.cs:309)
Mirror.NetworkIdentity:UNetStaticUpdate() (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:970)
Mirror.NetworkManager:LateUpdate() (at D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkManager.cs:126)
 
(Filename: D:/Projects/unity/Mirror/Assets/Mirror/Runtime/NetworkIdentity.cs Line: 570)

NullReferenceException: Object reference not set to an instance of an object
  at Mirror.NetworkIdentity.OnDeserializeSafely (Mirror.NetworkBehaviour comp, Mirror.NetworkReader reader, Boolean initialState) [0x00121] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:570 
  at Mirror.NetworkIdentity.OnDeserializeAllSafely (Mirror.NetworkBehaviour[] components, Mirror.NetworkReader reader, Boolean initialState) [0x00029] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:586 
  at Mirror.NetworkIdentity.OnUpdateVars (Mirror.NetworkReader reader, Boolean initialState) [0x00029] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:654 
  at Mirror.ClientScene.ApplySpawnPayload (Mirror.NetworkIdentity identity, Vector3 position, Quaternion rotation, System.Byte[] payload, UInt32 netId) [0x00051] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\ClientScene.cs:406 
  at Mirror.ClientScene.OnSpawnPrefab (Mirror.NetworkMessage netMsg) [0x001f1] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\ClientScene.cs:458 
  at Mirror.NetworkConnection.HandleBytes (System.Byte[] buffer) [0x0010a] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkConnection.cs:226 
  at Mirror.NetworkConnection.TransportReceive (System.Byte[] bytes) [0x00003] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkConnection.cs:276 
  at Mirror.NetworkClient.Update () [0x000cc] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkClient.cs:192 
  at Mirror.NetworkClient.UpdateClients () [0x0003c] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkClient.cs:309 
  at Mirror.NetworkIdentity.UNetStaticUpdate () [0x00006] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkIdentity.cs:970 
  at Mirror.NetworkManager.LateUpdate () [0x00000] in D:\Projects\unity\Mirror\Assets\Mirror\Runtime\NetworkManager.cs:126 
 
(Filename: D:/Projects/unity/Mirror/Assets/Mirror/Runtime/NetworkIdentity.cs Line: 570)

It seems to send a lot of 0 length packets? Is that even intended?
I was going to do some profiling, so I made a scene to stress test the networking with some controls which I have attached: Example.zip
It works in master and pauls #299 PR

@paulpach
Copy link
Contributor

@imerr I believe you are being affected by issue #314

There is a very simple workaround, right click on any script, and click reimport.

The proper fix is in #317 , but the workaround is good enough for now.

@imerr
Copy link
Contributor

imerr commented Jan 20, 2019

That didnt change anything, still getting the same error

@imerr
Copy link
Contributor

imerr commented Jan 24, 2019

Ok, lets try this again.
2000 entities moving around randomly in view with a networktransform on them
(the gc.collect call was present in nearly all frames where the networktransforms synced)
Master client: image
Master server: image
#299 client: image
#299 server: image
#302 client: image
#302 server: image

@paulpach
Copy link
Contributor

Something is not right, #302 should have no effect whatsoever on allocations in the client

@imerr
Copy link
Contributor

imerr commented Jan 24, 2019

and another two, #302 with the master merges from @paulpach (thats why the allocations were off)
client:
image
server:
image

@paulpach
Copy link
Contributor

paulpach commented Jan 24, 2019

@vis2k according to imer benchmark, he sees:

In #302 in the client he sees 2.8 MB GC and 16.60 ms
in #299 in the client he sees 1.7 MB GC and 9.9 ms

In my eyes, that is significant.
The difference would be much bigger if you have lots of components per gameobject (modular ummorpg).
Can you do improve performance in this PR?

@imerr
Copy link
Contributor

imerr commented Jan 24, 2019

it's probably a good idea to take out the gc timings when comparing aswell, since they were quite varied.
This gives us:

master #299 #302
client cpu 13ms 7.6ms 12.8ms
client gc 2.8MB 1.7MB 2.8MB
server cpu 26.4ms 21.1ms 25.4ms
server gc 5.8MB 4.5MB 4.5MB

@miwarnec
Copy link
Collaborator Author

wasn't worried about client gc yet. I am about to test server for this one with 500 CCU again in a few minutes.

@miwarnec
Copy link
Collaborator Author

480 CCU runs fine with this version. no noticeable GC spikes.
what do you guys think should we merge this for now? it's kind of a no-brainer since it's as safe, but faster.

we can still talk about and benchmark ondeserialize, barriers, etc. afterwards

@MichalPetryka
Copy link
Contributor

If this is working fine, merge this first and think about barriers etc. later

@paulpach
Copy link
Contributor

sure, merge it, it is better than master.

@paulpach
Copy link
Contributor

@imerr can you run your benchmark against 2018 branch? It has some more optimizations that should help.

@imerr
Copy link
Contributor

imerr commented Jan 31, 2019

Server:
image
Client:
image

@paulpach
Copy link
Contributor

paulpach commented Jan 31, 2019

@imerr if I am reading this right, these are the new results?

master #299 #302 2018
client cpu 13ms 7.6ms 12.8ms 14.12ms
client gc 2.8MB 1.7MB 2.8MB 1.9MB
server cpu 26.4ms 21.1ms 25.4ms 20.43ms
server gc 5.8MB 4.5MB 4.5MB 0.7MB

Is that correct?

@imerr
Copy link
Contributor

imerr commented Jan 31, 2019

Yep
Not sure what the telepathy transport is doing on the client in the 2018 branch, but I guess that's a different topic ;)

@paulpach
Copy link
Contributor

It is the exact same telepathy out of the box, that allocation should be happening in both branches.

@imerr
Copy link
Contributor

imerr commented Jan 31, 2019

Here's current master for comparison:
Server: image
Client:
image

@paulpach
Copy link
Contributor

paulpach commented Jan 31, 2019

ahhhhhh, ok, what is happening is that allocations were previously reported in the LateUpdate, now they are reported under TelepathyTransport.Update.

Those are the same allocations.

If it is not too much work, might be worth redoing some of the tests. Now with the component based transports allocations will show under TelepathyTransport.update.

There is also something suspicious, the 2018 branch client should be as fast as #299 or even a little faster.

@imerr
Copy link
Contributor

imerr commented Jan 31, 2019

We should probably just merge this if we're all happy with it anyways, it's an improvement compared to master
We'll also have a better baseline for further improvements too

@miwarnec miwarnec force-pushed the master branch 2 times, most recently from 8958b63 to e690360 Compare February 1, 2019 12:14
@miwarnec
Copy link
Collaborator Author

miwarnec commented Feb 1, 2019

(webhook test)

@miwarnec miwarnec merged commit dbbdcd0 into master Feb 1, 2019
paulpach added a commit to paulpach/Mirror that referenced this pull request Feb 1, 2019
@paulpach paulpach deleted the onserialize_improvements branch February 1, 2019 12:40
@miwarnec
Copy link
Collaborator Author

🎉 This PR is included in version 1.0.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

tomkrikorian pushed a commit to realcast/Mirror that referenced this pull request Dec 12, 2019
github-actions bot referenced this pull request in MirageNet/Mirage Jan 3, 2020
### Bug Fixes

* it is not safe to modify this outside this class ([bc7a961](bc7a961))
* list server logs properly when disconnected ([f02d317](f02d317))
* Make assembly definition 2018.4 compatible ([99ecc9e](99ecc9e))
* move listserver classes into package ([2668b17](2668b17))
* move NetworkStreamExtension in a namespace ([12de543](12de543))
* null reference exception ([7ce95c5](7ce95c5))
* potential null reference exception with debug logging ([33493a0](33493a0))

### Features

* Add UPM configuration ([#11](#11)) ([9280af1](9280af1))
* network writer pool to avoid expensive allocations ([3659acb](3659acb))
* update upm package if tests pass ([#21](#21)) ([7447776](7447776))

### Performance Improvements

* Optimize interest management ([f1ceb0c](f1ceb0c))
* reuse the network writer used for rpc parameters ([5dafc4d](5dafc4d))

### Reverts

* Revert "feat: Custom readers and writers" ([07ef8c9](07ef8c9))
* Revert "Onserialize improvements (#302)" ([00a3610](00a3610)), closes [#302](#302)
* Revert "Documented the attributes." ([9e3dcc7](9e3dcc7))
* Revert "Documented NetworkBehaviour and NetworkIdentity." ([a5cfc81](a5cfc81))
* Revert "Documented NetworkManager." ([5bc44a9](5bc44a9))
github-actions bot referenced this pull request in MirageNet/Mirage Jan 12, 2020
* position magic instead of allocating writers

* premature optimizations here we come

* simplify. less magic.

* better logging

* fix ondeserialize bugs caused by readpacketuint32 instead of readint32
github-actions bot referenced this pull request in MirageNet/Mirage Jan 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants