Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix random crashes on client disconnect #90

Open
wants to merge 9 commits into
base: master
from

Conversation

@only-a-ptr
Copy link
Member

commented Mar 26, 2019

I changed a few things that seemed dangerous to me, but I can't guarantee it will help.

UPDATE: I found an outrageous multithreading bug in disconnect logic (missing lock). This should fix the crashes for good.
Fixes #87

@RigsOfRods RigsOfRods deleted a comment from RoRBot Mar 26, 2019

@RigsOfRods RigsOfRods deleted a comment from RoRBot Mar 26, 2019

only-a-ptr added 4 commits Mar 26, 2019
🔧 More sensible client socket management
classes Receiver and Broadcaster (instantiated per Client) got updated pattern:
 * Keep Client* pointer instead of separate SWInetSocket* and int (user ID)
 * Stopping the thread is now done via std::atomic<bool> rather than naked bool.
 * Thread loop is now `while(true)`, stopping is done at single point using `break`.

Broadcaster+Webserver got updated as a consequence - class Client can no longer be trivially copied due to presence of std::atomic<>

@only-a-ptr only-a-ptr force-pushed the only-a-ptr:devel branch from 7427731 to 06b7b29 Mar 26, 2019

@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Mar 27, 2019

This PR makes the client crash on back to menu or exit game

[RoR|Networking] Disconnect() disconnecting...
[RoR|Networking] SendThread stopped
[RoR|Networking] Disconnect() sender thread stopped...
[RoR|Networking] NetFatalError(): disconnected: remote side closed the connection
[RoR|Networking] Disconnect() receiver thread stopped...
SWBaseSocket::waitIO() timeout!
@CuriousMike56

This comment has been minimized.

Copy link
Collaborator

commented Mar 27, 2019

📐 Tidy up statistics
Also fixes wrong format string in logging.

@only-a-ptr only-a-ptr changed the title Attempt to fix random crashes on client disconnect Fix random crashes on client disconnect Mar 29, 2019

🐛 Added missing lock
Also added a comment where lock must not be

@only-a-ptr only-a-ptr force-pushed the only-a-ptr:devel branch from 4f45db2 to 55078d6 Mar 30, 2019

@RoRBot

This comment has been minimized.

Copy link

commented Apr 1, 2019

🐛 Made clients-mutex recursive
Quick and dirty attempt to solve all possible deadlocks caused by 'clients_mutex'. Problem at hand: QueueClientForDisconnect() gets called both externally (must lock) and internally (must not lock). However, there may be multiple other scenarios due to the presence of scripting.

@only-a-ptr only-a-ptr force-pushed the only-a-ptr:devel branch from ad465d7 to 71dc697 Apr 1, 2019

@CuriousMike56

This comment has been minimized.

Copy link
Collaborator

commented Apr 6, 2019

Test server currently has 114 total disconnects, no crashes (yet).
But RoR still crashes on disconnect:

11:58:32: [RoR|Networking] Disconnect() disconnecting...
11:58:32: [RoR|Networking] SendThread stopped
11:58:32: [RoR|Networking] Disconnect() sender thread stopped...
11:58:32: [RoR|Networking] NetFatalError(): disconnected: remote side closed the connection
11:58:32: [RoR|Networking] Disconnect() receiver thread stopped...

I'm guessing it's an issue with RoR itself as the server seems to correctly disconnect the client. (the crash statistic doesn't update):

06-04-2019 11:53:29|t227| INFO| User disconnects on request: CuriousMike
06-04-2019 11:53:29|t227| INFO| ScriptEngine: success: removed a 'frameStep' callback: void chatPlugin_ping::frameStep(float)
06-04-2019 11:53:29|t227|VERBO| Disconnecting client ID 113: disconnected on request
06-04-2019 11:53:29|t227| INFO| crash statistic: 31 of 113 deletes crashed
...
...
06-04-2019 11:55:18|t229| INFO| User disconnects on request: CuriousMike
06-04-2019 11:55:18|t229| INFO| ScriptEngine: success: removed a 'frameStep' callback: void chatPlugin_ping::frameStep(float)
06-04-2019 11:55:18|t229|VERBO| Disconnecting client ID 114: disconnected on request
06-04-2019 11:55:18|t229| INFO| crash statistic: 31 of 114 deletes crashed
@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Apr 7, 2019

RoR crashes on exit because of this commit c130297

it seems server crashes are gone for good now

@only-a-ptr only-a-ptr self-assigned this Apr 8, 2019

@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2019

@only-a-ptr

c130297#diff-f576d98e2e496e5bc6fce42726493c9cL76

Changing while (m_msg_queue.empty()) { to while (m_msg_queue.empty() && m_keep_running) {
fixes RoR crash on disconnect

queue_entry_t msg;
// define a new scope and use a scope lock
{
MutexLocker scoped_lock(m_queue_mutex);
while (m_msg_queue.empty() && m_is_running) {
while (m_msg_queue.empty()) {

This comment has been minimized.

Copy link
@AnotherFoxGuy

AnotherFoxGuy Jun 2, 2019

Member
Suggested change
while (m_msg_queue.empty()) {
while (m_msg_queue.empty() && m_keep_running) {

#90 (comment)

@CuriousMike56

This comment has been minimized.

Copy link
Collaborator

commented Jun 10, 2019

There still is a possibility of a crash:
image

@only-a-ptr

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

@CuriousMike56 Are you sure that crash comes from this PR? Because it would make sense in upstream, but it doesn't make sense here. EDIT: Eh, it doesn't make sense either way, I'll just add a bit of superstituous code.

@tritonas00 @AnotherFoxGuy OK, I'll restore that code bit, it's awful but if it works, then it works. EDIT: I'll make one last attempt to be clever about it, if it fails, I'll just apply your suggestion and be done.

🐛 Fixed Broadcaster shutdown
The wait loop was missing a `m_keep_running` check which is needed for clean shutdown.

@only-a-ptr only-a-ptr force-pushed the only-a-ptr:devel branch from f1b72c7 to 671c751 Jun 11, 2019

@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Jun 11, 2019

@CuriousMike56 test this please

@CuriousMike56

This comment has been minimized.

Copy link
Collaborator

commented Jun 16, 2019

image

@only-a-ptr

This comment has been minimized.

Copy link
Member Author

commented Jun 17, 2019

Interesting. Last time the crash happened in std::deque::pop_front(), which may be legitimate if the std::deque is empty at the time. This time it happens in std::deque::push_back() which is never legitimate. We have a memory (heap, to be specific) corruption problem.

@CuriousMike56 If you can, try running the rorserver with valgrind+memcheck: http://cs.ecs.baylor.edu/~donahoo/tools/valgrind/ - I'll try with Windows tools: https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/gflags-and-pageheap

@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2019

It works fine and also on the first client connect it says:

valgrind --tool=memcheck --track-origins=yes bin/rorserver -lan -port 12000 -fg

18-06-2019 16:06:49|t02| INFO|New client: Archlinux (en_gr), using RoR 0.4.8.0-dev-db39d081-dirt, with token DA39A3EE5E6B4B0D3255BFEF95601890AFD80709
==20993== Thread 4:
==20993== Conditional jump or move depends on uninitialised value(s)
==20993==    at 0x16E15B: Sequencer::queueMessage(int, int, unsigned int, char*, unsigned int) (sequencer.cpp:735)
==20993==    by 0x169358: Receiver::Thread() (receiver.cpp:96)
==20993==    by 0x169428: LaunchReceiverThread(void*) (receiver.cpp:31)
==20993==    by 0x4892A91: start_thread (in /usr/lib/libpthread-2.29.so)
==20993==    by 0x4C9DCD2: clone (in /usr/lib/libc-2.29.so)
==20993==  Uninitialised value was created by a heap allocation
==20993==    at 0x4838DEF: operator new(unsigned long) (vg_replace_malloc.c:334)
==20993==    by 0x16D5CA: Sequencer::createClient(SWInetSocket*, RoRnet::UserInfo) (sequencer.cpp:258)
==20993==    by 0x166249: Listener::threadstart() (listener.cpp:230)
==20993==    by 0x166D38: s_lsthreadstart(void*) (listener.cpp:44)
==20993==    by 0x4892A91: start_thread (in /usr/lib/libpthread-2.29.so)
==20993==    by 0x4C9DCD2: clone (in /usr/lib/libc-2.29.so)
==20993==

@only-a-ptr uninitialised pointer somewhere?

@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Jun 19, 2019

@tritonas00 @AnotherFoxGuy OK, I'll restore that code bit, it's awful but if it works, then it works. EDIT: I'll make one last attempt to be clever about it, if it fails, I'll just apply your suggestion and be done.

RoR still crashes sometimes though (also with my solution) 😒

If you go now to nhelens server and return to menu RoR crashes. Restarting the server probably will fix it but it shouldn't happen in the first place

@tritonas00

This comment has been minimized.

Copy link
Contributor

commented Jun 24, 2019

@only-a-ptr
alt text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.