Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sv_parallel_sendsnapshot and sv_parallel_packentities crash servers randomly #5316

Closed
FredyH opened this issue Aug 27, 2022 · 15 comments
Closed
Labels
Crash The issue is a "random" runtime crash.

Comments

@FredyH
Copy link

FredyH commented Aug 27, 2022

Details

sv_parallel_sendsnapshot is a console command that significantly improves performances of srcds at high player counts. In my tests, performance was improved by up to 20-40% at high player counts, presumably since a big portion of each tick is spent just sending data to each client sequentially. When the command is enabled, this work is done in parallel instead.

The leaked source engine code states that these commands are currently broken and should be disabled. However, as I've been told, the crashes appear to be fixed in the CS:GO version of the source engine, so I was wondering if these changes could possibly be backported to gmod due to the big impact they have on server performance.

See also ValveSoftware/Source-1-Games#11 for an old report of this issue.

Steps to reproduce

This is extremely difficult to reproduce but happens roughly once a day at high player counts (100+) for me.

@robotboy655 robotboy655 added the Crash The issue is a "random" runtime crash. label Aug 28, 2022
@FredyH
Copy link
Author

FredyH commented May 6, 2023

Just as an addendum, here is a quick showcase of why the two commands improve performance so significantly. Here is a snapshot of where my server at 128 players spends most of its time without the commands enabled (captured using perf record -g --pid , displayed using perf report, or using a more advanced flame graph tool):
image

As you can see, over 50% of time is spent just sending messages to clients!

@mcNuggets1
Copy link

Thats incredible. The performance impact on networking is something different.
Is your networking by any means unoptimised? Eg. using a lot of broadcast net messages or dt/nw vars across the server?

@FredyH
Copy link
Author

FredyH commented May 12, 2023

Our networking is fairly optimized. We send roughly 20KB/s server-client net traffic (entire server, not per client!), so it has barely an impact. This is compared to ~20KB/s per client that the engine uses on our serve to sync entity data (positions, etc.) which you can see in the net_graph.
You can also see the actual culprits making up the vast majority of the time spent on networking in the flame graph, namely

  • WriteDeltaEntities
  • CheckTransmit
  • PackEntities

All of those are about writing entities and their data, not net or userdata calls.

@RaphaelIT7
Copy link

I found a comment in the leaked TF2 code that could help to solve this problem.

If enabled, random crashes start to appear in WriteTempEntities, etc. It looks like
one thread can be in WriteDeltaEntities while another is in WriteTempEntities, and both are
partying on g_FrameSnapshotManager.m_FrameSnapshots. Bruce sent e-mail from a customer
that stated these crashes don't occur when parallel_sendsnapshot is disabled. Zoid said:

Easiest is just turn off parallel snapshots, it's not much of a win on servers where we
are running many instances anyway. It's off in Dota and CSGO dedicated servers.

Bruce also had a patch to disable this in //ValveGames/staging/game/tf/cfg/unencrypted/print_instance_config.py

@FredyH
Copy link
Author

FredyH commented Jun 8, 2023

The issue was already fixed by Valve in CS:GO and the appropriate fix can be found in the corresponding leaked code (essentially it boils down to obtaining a mutex when accessing the m_FrameSnapshots).
It would not be overly hard to backport that change to Garry's Mod either I believe.

@RaphaelIT7
Copy link

RaphaelIT7 commented Jun 8, 2023

Wouldn't it be possible to write a DLL that detours CGameServer::SendClientMessages and solves this issue.
I don't know if something like that is possible, but would you have any idea?

Btw, here is the leaked CS:GO code

@FredyH
Copy link
Author

FredyH commented Jun 8, 2023

Yes it can be fixed by detouring certain functions and essentially applying Valve's fix. Rumours have it that some people have already done exactly that but it is closed source (I have yet to verify).

@RaphaelIT7
Copy link

RaphaelIT7 commented Jun 8, 2023

CS:GO uses a CThreadFastMutex for the array here I think that is actually the fix.

Edit: If I understand the cause of the crash correctly, one would only need to use a CThreadFastMutex in all functions that are potentially used by another thread.

@RaphaelIT7
Copy link

RaphaelIT7 commented Jun 9, 2023

So after looking into it, it looks like only some lines would have to be added to fix the whole problem. This could be an approach:

engine/framesnapshot.h

class CFrameSnapshotManager
{
[...]
private:
	void	DeleteFrameSnapshot(CFrameSnapshot* pSnapshot);

	CThreadFastMutex					m_FrameSnapshotsWriteMutex; // This was added
	CUtlLinkedList<CFrameSnapshot*, unsigned short>		m_FrameSnapshots;
	CClassMemoryPool< PackedEntity >			m_PackedEntitiesPool;

engine/sv_framesnapshot.cpp

CFrameSnapshot* CFrameSnapshotManager::CreateEmptySnapshot(int tickcount, int maxEntities)
{
	AUTO_LOCK_FM(m_FrameSnapshotsWriteMutex); // This was added
	CFrameSnapshot* snap = new CFrameSnapshot;
	snap->AddReference();
	snap->m_nTickCount = tickcount;
	snap->m_nNumEntities = maxEntities;
	snap->m_nValidEntities = 0;
	snap->m_pValidEntities = NULL;
	snap->m_pHLTVEntityData = NULL;
	snap->m_pReplayEntityData = NULL;
	snap->m_pEntities = new CFrameSnapshotEntry[maxEntities];

	CFrameSnapshotEntry* entry = snap->m_pEntities;

	// clear entries
	for (int i = 0; i < maxEntities; i++)
	{
		entry->m_pClass = NULL;
		entry->m_nSerialNumber = -1;
		entry->m_pPackedData = INVALID_PACKED_ENTITY_HANDLE;
		entry++;
	}

	snap->m_ListIndex = m_FrameSnapshots.AddToTail(snap);
	return snap;
}
void CFrameSnapshotManager::DeleteFrameSnapshot(CFrameSnapshot* pSnapshot)
{
	AUTO_LOCK_FM(m_FrameSnapshotsWriteMutex); // This was added

	// Decrement reference counts of all packed entities
	for (int i = 0; i < pSnapshot->m_nNumEntities; ++i)
	{
		if (pSnapshot->m_pEntities[i].m_pPackedData != INVALID_PACKED_ENTITY_HANDLE)
		{
			RemoveEntityReference(pSnapshot->m_pEntities[i].m_pPackedData);
		}
	}

	m_FrameSnapshots.Remove(pSnapshot->m_ListIndex);
	delete pSnapshot;
}
void CFrameSnapshot::ReleaseReference()
{
	AUTO_LOCK_FM(framesnapshotmanager->m_FrameSnapshotsWriteMutex); // This was added

	Assert(m_nReferences > 0);

	--m_nReferences;
	if (m_nReferences == 0)
	{
		g_FrameSnapshotManager.DeleteFrameSnapshot(this);
	}
}

@robotboy655
Copy link
Contributor

I have applied the supposed fix from CS:GO to dev, please let me know if that causes more isssues or fixes the issue for you.

@RaphaelIT7
Copy link

RaphaelIT7 commented Jun 9, 2023

our server crashed with this error:

[GM:CallWithArgs - !ThreadInMainThread] OnRequestFullUpdate
[GM:CallWithArgs - !ThreadInMainThread] OnRequestFullUpdate
[GM:CallWithArgs - !ThreadInMainThread] OnRequestFullUpdate
[GM:CallWithArgs - !ThreadInMainThread] OnRequestFullUpdate
L 06/09/2023 - 20:22:47: Engine error: [GM:CallWithArgs - !ThreadInMainThread] OnRequestFullUpdate
[GM:CallWithArgs - !ThreadInMainThread] OnRequestFullUpdate

Could it be that in one of the threads OnRequestFullUpdate gets called? And that if
gameevent.Listen("OnRequestFullUpdate") is used, it tries to do something with LUA from another thread?

Edit: After testing again, I found out that removing all gameevent.Listen("OnRequestFullUpdate") in our code solves these crashes.

@robotboy655
Copy link
Contributor

I doubt it is related to the recent changes, I haven't added any new threading.

That being said, I cannot even reproduce it.

@RaphaelIT7
Copy link

RaphaelIT7 commented Jun 9, 2023

After testing a bit, I wasn't able to reproduce the crash.
I thought it was something with gameevent.Listen("OnRequestFullUpdate") because the error indicated that it's trying to call some LUA Stuff with OnRequestFullUpdate and after removing all lines with gameevent.Listen("OnRequestFullUpdate") the server stopped crashing.
(I tried to use cl_fullupdate while two players were on the server because, if I saw correctly, there must be more than one player on the server for the networking to be threaded.)

I also don't think this crash was created by some recent changes, because if I remember correctly, I saw this crash a few months ago when I tested the convar.

But the main problem seems to be fixed because the server stopped crashing silently (which it did while the convar was broken)

@FredyH
Copy link
Author

FredyH commented Jun 11, 2023

I've had it run on my server for a day now without any crashes, so it looks promising.
If it does end up crashing I will post it here.

@RaphaelIT7
Copy link

Same here. It's also possible to activate / deactivate sv_parallel_sendsnapshot while a bunch of users are connected.
Before the fix, a server would just instantly crash, if I remember correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Crash The issue is a "random" runtime crash.
Projects
None yet
Development

No branches or pull requests

4 participants