Few small bug fixes in the connectors. #5

delian · 2014-01-23T17:20:39Z

Hello to all,
I am very new to the mistserver (I heard of it for a first time couple of weeks ago). But I want to suggest few bug fixes.
I am dealing with issues related with the Mist server under "heavy" load (few hundred connections. This is not a heavy load to my experience, but seems to be very heavy to the Mist). Probably it will not surprise you to hear that this software in particular is not really well designed to scale. However, this is a different topic of discussions.

During my investigations of the code i found a bunch of kiddish bugs that also affect the stability and the scalability of the software. They are present in all of the releases I've check - 1.0, 1.2 and 1.3.

As I am not a contributor and I really didn't found how to report a bug or patch, so I am writing you here.

I have found bugs and issues in the buffer process, connector process and the RTMP process. However this message will be related to the connector only, as I have no much time to write you more.

The connector design is quite ineffective. It forks a process for each new customer connection. It has two kiddish bugs associated with this -
There is no signal interception at the master. So sometimes when a child dies, it became a Zombie and slowly kills the machine process table resources (although memory and everything else is freed).
To fix this is quite simple.
You just need to add in the beginning

 #include <signal.h>

and somewhere in the beginning of the main function:

 signal(SIGCHLD,SIG_IGN);

I did it and it works very fine for me.

However, the major problem within every connector code is that within the While loop for accept -> fork of new Unix socket sessions, when it forks a child, the master does not close the accepted connection.
As an effect the master keeps a file descriptor open. Which will be copied in the next forked child (it copy all of the file descriptors) but the fd count will be increased in the master again and then copied to the next forked child, and so on, and so on.
The effect of this is that every new child fork will became slower and slower (all the fd has to be copied) but mostly the allocated fd per process (and the system in general) grows exponentially as they are never closed. In a mist server under heavy load (where we have stream of opening new short lived customer connections) the total limit of file descriptors of the system/process will be exhausted on a logarithmic basis (the max span of the process live before the issue happens will be constant * log2(total-process-limit)). At some point there will be not enough resources for a new child to open, and the new fork will crash. The old connections will still work perfectly though.
I cannot imagine the pain of all the customers of the software that have only one way to resolve this issue when it happens - to restart the mistserver periodically.
However, the resolution is quite, quite, simple.
You just need to add close(S.getSocket()) in the master process. In every connector code you have while loop with a code like that (taken from HTTPProgressive):

Socket::Connection S = server_socket.accept();
if (S.connected()){ //check if the new connection is valid
  pid_t myid = fork();
  if (myid == 0){ //if new child, start MAINHANDLER
    return Connector_HTTP::progressiveConnector(S);
  }else{ //otherwise, do nothing or output debugging text
#if DEBUG >= 5
    fprintf(stderr, "Spawned new process %i for socket %i\n", (int)myid, S.getSocket());
#endif
    close(S.getSocket()); // THIS LINE HERE IS THE BUG FIX!!!
  }

You just need to add there (where I have shown) "close(S.getSocket());" and everything will be fine.
You can easily verify the bug fix with lsof.
If everything is correct the lsof -p PID-of-the-master-connector should show you that the master has only two unix sockets open, and every child too.
If you see more sockets (and exponential growth of the allocated file descriptors) then you haven't applied the patch.
It works perfectly for me and fixes a lot of issues.
I have been very annoyed to apply this patch in the last 3 days to 1.0, 1.2 and 1.3 versions of the mist server, so please integrate it in the main code!
If I have a time I will write you about few other small patches you can do.

The text was updated successfully, but these errors were encountered:

Phencys · 2014-01-24T08:03:25Z

Hello Delian,

Thanks for the feedback, our project always likes to listen to new people trying out our software.

As for the bugfixes:

The SIG_CHILD handler, is already been taken care of currently. Even though we don't explicitly create any signal handlers in the connector processes, we have our Config lib to thank for this. In lib/config.cpp we have the activate() function which sets up all the signal handlers appropriately. This function is called in every connector, as well as the main process.
The S.close() call in the child is indeed something we have missed out on, thanks for pointing it out :) We are currently working hard to get the 1.3.1 patch release up and running, and the fix will be incorporated into this patch.

Greetings,
Erik

delian · 2014-01-27T11:38:57Z

You may already know it, but S.close() does not help because of the unclean shutdown you do on a socket in the libmist. The result is that the loop is blocked and the whole process stop performing. I intentionally did close(S.getSocket()) as this is a way to correctly close the socket and not blocking the process, avoiding libmist.
A similar issue exists in the httpconnector, which is partially hidden by the fact that you use threads there. As a result it leaves a huge amount of half closed sockets and even though I haven't been affected personally, I am guessing it is also bad for the performance of the application. With threads you do not need to close the socket in the master thread, but you still need to have a clean close at the end of the thread, which is not happening for some reason.

The code you have in mistlib's socket.cpp ...

  if (sock != -1){
    shutdown(sock, SHUT_RDWR);
    errno = EINTR;
    while (::close(sock) != 0 && errno == EINTR){
    }
    sock = -1;
  }

...will block forever (at least on linux) if you have non empty buffers in the socket (quite often). Also, if you carefully look, you will see that the while will block forever. Why simply you do not replace that code with:

if (sock!=-1) { close(sock); sock = -1; }

... and leave the kernel to do the thing you were trying to do? This way your threads will not block and S.close() will perform correctly.

Phencys · 2014-01-27T11:59:53Z

Hello Delian,

Thanks for the reply.

I am wondering however, to what extent the shutdown of our sockets would be
considered "unclean". As far as we know we do everything the way it should
-- same as the linux kernel headers.

Regards,

On Mon, Jan 27, 2014 at 12:39 PM, Delian Delchev
notifications@github.comwrote:

You may already know it, but S.close() does not help because of the
unclean shutdown you do on a socket in the libmist (I guess you already
know it). The result is that the loop is blocked and the whole process stop
performing. I intentionally did close(S.getSocket()) as this is a way to
correctly close the socket and not blocking the process.
A similar issue exists in the httpconnector, which is partially hidden by
the fact that you use threads there. As a result it leaves a huge amount of
half closed sockets and even though I haven't been affected personally, I
am guessing it is also bad for the performance of the application.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/5#issuecomment-33360498
.

delian · 2014-01-27T12:08:00Z

Just do a little experiment - try to debug httpprogressive process, having S.close() set, while there is a traffic sent to the user (the user is having a lower bw connection so the output socket buffer is full).
Then you will see that S.close() will block. When you start investigating you will see that shutdown(...) and then ::close() never returns 0. Even if shutdown is successful, close may return error, because the socket is already destroyed. So your loop will block forever. close + shutdown is redundant operation. But also shutdown does allow a half closed sessions (even though you try to close them in both directions by a parameter) and this is what I see in the lsof when my mist server crashes -

MistConnH 1733 root cwd MistConnH 1733 root rtd ...
MistConnH 1733 root 5u MistConnH 1733 root 6u MistConnH 1733 root 7u MistConnH 1733 root 8u MistConnH 1733 root 9u MistConnH 1733 root 10u MistConnH 1733 root 11u MistConnH 1733 root 12u MistConnH 1733 root 13u MistConnH 1733 root 14u MistConnH 1733 root 15u MistConnH 1733 root 16u MistConnH 1733 root 17u MistConnH 1733 root 18u MistConnH 1733 root 19u MistConnH 1733 root 20u MistConnH 1733 root 21u MistConnH 1733 root 22u MistConnH 1733 root 23u MistConnH 1733 root 24u MistConnH 1733 root 25u MistConnH 1733 root 26u MistConnH 1733 root 27u MistConnH 1733 root 28u ...
MistConnH 1733 root 1081u MistConnH 1733 root 1082u MistConnH 1733 root 1083u MistConnH 1733 root 1084u MistConnH 1733 root 1085u ....
MistConnH 1733 root 1994u MistConnH 1733 root 1995u MistConnH 1733 root 1996u ....
(see 2000+ half closed sockets) DIR 252,0 4096 655618 /home/mist/mistserver
DIR 252,0 4096 2 /
IPv6 10190 0t0 TCP *:http-alt (LISTEN)
unix 0xffff88013788ad80 0t0 11681606 socket
IPv6 14745133 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48088 (CLOSE_WAIT)
IPv6 15386914 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49406 (CLOSE_WAIT)
IPv6 14745134 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48089 (CLOSE_WAIT)
IPv6 11694990 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:46120 (ESTABLISHED)
IPv6 13536320 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:47544 (ESTABLISHED)
IPv6 15332175 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49332 (CLOSE_WAIT)
unix 0xffff880107a0d380 0t0 14746778 socket
IPv6 14745139 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48090 (CLOSE_WAIT)
IPv6 14745142 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48091 (CLOSE_WAIT)
IPv6 13162442 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:47372 (ESTABLISHED)
IPv6 15462473 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49528 (CLOSE_WAIT)
IPv6 14745145 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48092 (CLOSE_WAIT)
IPv6 14992270 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48694 (CLOSE_WAIT)
IPv6 15439011 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49493 (CLOSE_WAIT)
IPv6 14745147 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48093 (CLOSE_WAIT)
IPv6 14777189 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48267 (CLOSE_WAIT)
unix 0xffff8801341be080 0t0 11697711 socket
IPv6 14745582 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48132 (CLOSE_WAIT)
IPv6 14745149 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48094 (CLOSE_WAIT)
IPv6 14745155 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48095 (CLOSE_WAIT)
IPv6 15519771 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49602 (CLOSE_WAIT)
IPv6 14745586 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:48133 (CLOSE_WAIT)
IPv6 15142618 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49014 (CLOSE_WAIT)
IPv6 15142644 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49015 (CLOSE_WAIT)
IPv6 15147136 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49016 (CLOSE_WAIT)
IPv6 15147188 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49017 (CLOSE_WAIT)
IPv6 15147217 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49018 (CLOSE_WAIT)
IPv6 15811895 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49991 (CLOSE_WAIT)
IPv6 15812200 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49992 (CLOSE_WAIT)
IPv6 15812575 0t0 TCP 365limed05.visir.is:http-alt->365limed05.visir.is:49993 (CLOSE_WAIT

http://beej.us/guide/bgnet/output/html/singlepage/bgnet.html#closedown

delian · 2014-01-27T12:16:55Z

I am doing a little experiment - replacing the close() code in the mistlib with what I just said (pure close), as I just realized from my logs that my mist httpconnector has been blocked because the total max amount of threads has been reached, as they are not cleaned because they have been blocked by the close() method.

delian · 2014-01-27T15:55:27Z

For the moment the test I did behave well, I have now a lot less close_wait sockets, but I still have them. Thay may happen if the socket is not closed somewhere, while the thread exits (and this way it will even keep the thread alive). Soon that may exhaust the total amount of threads per process quota. So either this was not the reason for the CLOSE_WAIT or it was not the only reason. I need to wait for 24h to confirm though.

However, I have another suggestion -
Your DEBUG printouts are very much useless. Not only that but in 1.2 (and I guess in 1.3) they are not fixed according to the new structures so there are couple of places the compilation breaks if you compile it with DEBUG.

I would suggest you instead of doing

#ifdef DEBUG
fprintf(err,'asldas');
#endif

to replace them with a macro, defined in debug.h for example (where you will do your checks).

The macro can have the value like:

#ifdef DEBUG
#define dbg(...) { fprintf(stderr,"%s:%d:",__FILE__,__LINE__); fprintf(stderr,##__VA_ARGS__); };
#else
#define dbg(...)
#endif

Then you can remove the unistd.h and the rest includes you use in every file just because of the debug, and replace the #if DEBUG>xxx or #ifdef DEBUG's with just dbg("%s", .... );
This will simplify the compilation. Will allow you easily to redefine the debug macro rules, will decrease the amount of lines you have in every code, but mainly for me - i will be able to find which one of the processes and where exactly generated that printout, as to be fair, there are a lot of messages with the same value, produced by different processes and it is really hard to follow a flow, if you have multiple connections.

Thulinma · 2014-01-27T16:11:14Z

Funny that you should mention this - I have just finished making exactly this change.
I'm also working on unifying the connector code and cleaning up some left over sockets (there are several other places where a socket or two was forgotten).
You can expect this code to land in the development branch somewhere today or tomorrow, and in the master branch sometime this week.
Running some tests first, before making it all publicly available.

On a different but related topic: calling shutdown before closing is necessary on some systems to signal EOF to the other end of the connection. However, calling shutdown on a socket that is still going to be used is of course horribly wrong - and I'm working on fixing this, right now. Same as above - code will land sometime today or tomorrow.

Thanks again for your continued feedback and comments! If you find anything else that you think isn't behaving as it should, please don't hesitate to let us know. :-)

Thulinma · 2014-01-27T17:17:26Z

For reference - the code I spoke about just landed in the development branches of mistserver/mistlib. Make sure to checkout and compile both, or it won't work.

Not all debug messages have been updated to the new format yet - and not all of them work at levels above 4 right now. Fixes for that will arrive soon, too.

delian · 2014-01-27T21:01:32Z

Ok, I will look at it, I never even thought there may be different branches there :)

Thulinma · 2014-01-27T21:04:31Z

We keep the master branch at the currently stable version. The development branch has the latest changes and contains "soon to be stable" code. So, if you want the latest changes you'll want to keep up with the development branches :-)

Thulinma closed this as completed Feb 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Few small bug fixes in the connectors. #5

Few small bug fixes in the connectors. #5

delian commented Jan 23, 2014

Phencys commented Jan 24, 2014

delian commented Jan 27, 2014

Phencys commented Jan 27, 2014

delian commented Jan 27, 2014

delian commented Jan 27, 2014

delian commented Jan 27, 2014

Thulinma commented Jan 27, 2014

Thulinma commented Jan 27, 2014

delian commented Jan 27, 2014

Thulinma commented Jan 27, 2014

Few small bug fixes in the connectors. #5

Few small bug fixes in the connectors. #5

Comments

delian commented Jan 23, 2014

Phencys commented Jan 24, 2014

delian commented Jan 27, 2014

Phencys commented Jan 27, 2014

delian commented Jan 27, 2014

delian commented Jan 27, 2014

delian commented Jan 27, 2014

Thulinma commented Jan 27, 2014

Thulinma commented Jan 27, 2014

delian commented Jan 27, 2014

Thulinma commented Jan 27, 2014