Skip to content

drivers/unix_socket: Let clients to freely connect/disconnect#613

Merged
sangjinhan merged 12 commits intoNetSys:masterfrom
llvilanova:fix-unix-socket
Aug 24, 2017
Merged

drivers/unix_socket: Let clients to freely connect/disconnect#613
sangjinhan merged 12 commits intoNetSys:masterfrom
llvilanova:fix-unix-socket

Conversation

@llvilanova
Copy link
Copy Markdown
Contributor

In order to allow external controllers (connected through a UNIX socket) truly
independent of bessd, client disconnection must always be detected, regardless
of whether the port is used or not. This is done by keeping a thread
around (using epoll).

Before, this driver would only reliably detect disconnects when the port is used
to receive packets. When used to send packets, the port would only detect
disconnects on an actual send, which depends on the processed packets; this led
to only some instances accepting new clients, making external program
reconnection impossible.

In order to allow external controllers (connected through a UNIX socket) truly
independent of bessd, client disconnection must always be detected, regardless
of whether the port is used or not. This is done by keeping a thread
around (using epoll).

Before, this driver would only reliably detect disconnects when the port is used
to receive packets. When used to send packets, the port would only detect
disconnects on an actual send, which depends on the processed packets; this led
to only some instances accepting new clients, making external program
reconnection impossible.
@llvilanova
Copy link
Copy Markdown
Contributor Author

I'm not sure how this patch affects the failed test. Any ideas?

@sangjinhan
Copy link
Copy Markdown
Member

Seems like the failure happens with 32bit build and it is reproducible. Trying to debug it...

@codecov
Copy link
Copy Markdown

codecov Bot commented Aug 22, 2017

Codecov Report

Merging #613 into master will increase coverage by 0.02%.
The diff coverage is 77.58%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #613      +/-   ##
==========================================
+ Coverage   69.29%   69.32%   +0.02%     
==========================================
  Files         204      204              
  Lines       13037    13061      +24     
==========================================
+ Hits         9034     9054      +20     
- Misses       4003     4007       +4
Impacted Files Coverage Δ
core/drivers/unix_socket.h 100% <100%> (ø) ⬆️
core/drivers/unix_socket.cc 80.91% <77.19%> (-4.14%) ⬇️
core/modules/flowgen.cc 74.07% <0%> (+0.37%) ⬆️
core/dpdk.cc 82.55% <0%> (+4.65%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ee1973...8cd0de6. Read the comment docs.

@llvilanova
Copy link
Copy Markdown
Contributor Author

Is there some way for me to easily test the failing experiment on that build?

Also, is it possible that DeInit is being called before Init? (I've checked this never happens when Init returns a command error) That could trigger a file descriptor close before properly setting the descriptors (they should all be zero from the constructor).

@llvilanova
Copy link
Copy Markdown
Contributor Author

I just did a cleanup to ensure there's no threads or FDs left behind when a module is destroyed, but it still shows the same failure.

If you give me a command line to reproduce the failing build and specific test I'll give it a shot in the debugger.

@sangjinhan
Copy link
Copy Markdown
Member

I don't know how to reliably reproduce the issue locally. What I did was using Travis with my private fork of this repo. The issue seems to happen only with 32bit builds.

What I realized is that bessctl/conf/testing/module_tests/vlan.py opens lots of (more precisely, 450) UnixSocketPort. Given that each port now opens three (not two) file descriptors, my theory is that it may exceed the file descriptor limit of 1024. For processes launched by root, maybe this limit only applies to 32bit programs. Could you update the file as below?

-        return [VLANSplit(), 1, 150, expected]
+        return [VLANSplit(), 1, 30, expected]

-    OUTPUT_TEST_INPUTS.append(output_test([1, 100, 77, -1, 149, 50, 100, -1]))
-    OUTPUT_TEST_INPUTS.append(output_test([100, 77, -1, 149, 50, 100, -1, 33, 70]))
-    OUTPUT_TEST_INPUTS.append(output_test([100, 77, -1, 149, 50, 100, -1, 33, 70], True))
+    OUTPUT_TEST_INPUTS.append(output_test([1, 17, -1, 29, 10, 13, 7]))
+    OUTPUT_TEST_INPUTS.append(output_test([1, 17, -1, 29, 10, 13, 7], True))

@llvilanova
Copy link
Copy Markdown
Contributor Author

I'm on it.

Lluís Vilanova added 2 commits August 22, 2017 21:25
@changlan changlan requested review from shwang and removed request for shwang August 22, 2017 20:14
@sangjinhan
Copy link
Copy Markdown
Member

Two requests regarding file descriptor usage:

  • Instead of using epoll, could you switch to select or poll?
  • Instead of using pipe, you can terminate the thread with pthread_cancel or pthread_kill
    • Let's have a separate function for the thread, rather than a large lambda block.

@llvilanova
Copy link
Copy Markdown
Contributor Author

Can't use poll because wakeup conditions apply to all FDs (therefore the thread will be woken up too often - i.e., every time there's data on the client socket, not only when it's closed, since we need to wait for available reads on the accept socket).

Can't use select either because socket disconnect can only be detected together with available read events (i.e., will wake up too often when the client socket is readable, not only when it's closed).

So, is it OK to keep it with epoll?

@sangjinhan
Copy link
Copy Markdown
Member

I am not sure if I understood correctly. You can add client_fd_ only to exceptfds for select(), or set it only with POLLRDHUP for poll()? Just like epoll, you should be able to specify what events you want to monitor for individual file descriptors. I vaguely remember that all select/poll/epoll are implemented with the same backend code in the kernel.

Maybe I am missing something here. If you find epoll is the only way to reliably detect disconnection of client_fd_, please go ahead.

Lluís Vilanova added 2 commits August 23, 2017 02:39
Using signals instead of pipes minimizes the number of open file
descriptors (two FDs per instance).
Using ppoll() instead of epoll_pwait() saves us from using one additional file
descriptor per instance, without introducing any measurable performance
overhead.
@llvilanova
Copy link
Copy Markdown
Contributor Author

Ooops, sorry. I read poll too quickly and misunderstood it; yes it's a suitable replacement to epoll_wait.

For the record, the limitations on select still stand: there is no equivalent to EPOLLHUP/POLLHUP in select, and one must instead add a FD into readfds and check if a read returns zero (overkill for this FD monitoring thread).

Lluís Vilanova added 2 commits August 23, 2017 03:10
Only one client connection is supported now, so ignore new ones as long as the
current connection is not severed.
Copy link
Copy Markdown
Member

@sangjinhan sangjinhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for the update. I'll merge once those minor comments are addressed.


recv_skip_cnt_ = 0;
while (true) {
fds[1].fd = client_fd_;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you add a comment that negative file descriptors are ignored by poll() for clafirication?

Comment thread core/drivers/unix_socket.cc Outdated
if (errno == EINTR) {
continue;
} else {
PLOG(ERROR) << "[UnixSocketPort]:epoll_wait()";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to ignore this: you don't need to add [UnixSocketPort]: to the log message since the filename will be automatically added by PLOG. Also, epoll_wait() -> ppoll()

Comment thread core/drivers/unix_socket.cc Outdated
} else if (fds[1].revents & (POLLRDHUP | POLLHUP)) {
// connection dropped by client
int fd = client_fd_;
client_fd_ = -1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kNotConnectedFd in place of -1

Comment thread core/drivers/unix_socket.cc Outdated
// Relaunch the accept thread.
std::thread accept_thread(AcceptThreadMain, reinterpret_cast<void *>(this));
accept_thread.detach();
static void AcceptThreadHandler(int sig) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: You can just omit the argument name (just static void AcceptThreadHandler(int) {) to indicate the argument will be unused.

Comment thread core/drivers/unix_socket.cc Outdated
}


accept_thread_ = std::thread([&]() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[this] can be used in place of [&].

Comment thread core/drivers/unix_socket.cc Outdated
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.

#include <signal.h>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <poll.h> should appear first.

Comment thread core/drivers/unix_socket.h Outdated
#define BESS_DRIVERS_UNIXSOCKET_H_

#include <assert.h>
#include <atomic>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <poll.h>

#include <atomic>
#include <cassert>
#include <cerrno>
#include <cstdio>

The Google C++ Style Guide says all C++ headers should appear after C headers. Each group needs to be alphabetically sorted.

Do we need cassert and cstdio? Also, poll.h and cerrno seem only necessary in the .cc file, not in the header.

@sangjinhan sangjinhan merged commit 0b4ffeb into NetSys:master Aug 24, 2017
@sangjinhan
Copy link
Copy Markdown
Member

Merged, thanks!
tumblr_m44vnvr00w1rtuomto1_1280

@llvilanova llvilanova deleted the fix-unix-socket branch August 27, 2017 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants