Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang waiting for create_server #39

Closed
jimfulton opened this issue Jul 7, 2016 · 26 comments
Closed

Hang waiting for create_server #39

jimfulton opened this issue Jul 7, 2016 · 26 comments
Labels

Comments

@jimfulton
Copy link

  • uvloop 0.4.33:
  • python 3.5.1, 3.5.2:
  • platform Mac OS X, Ubuntu 16.04/4.4.12-boot2docker:

I need some suggestions for debugging server startup hangs.

Startup code looks like:

 cr = loop.create_server(self.factory, addr[0], addr[1],
                                reuse_address=True, ssl=ssl)
f = asyncio.async(cr, loop=loop)
server = loop.run_until_complete(f)

What's sometimes happening, in the context of the ZEO tests, is that the run_loop_until_complete call above never returns.

This is in a port of ZEO, github.com/zopefoundation/ZEO, to asyncio. The ZEO tests are exhaustive, and I suspect in uvloop's case, exhausting. :) The tests start and stop servers 10s, maybe 100s of times in a test runs that typically lasts a few minutes. If I run a smaller subset of the tests I don't get a hang.

Most tests run the server as a subprocess using multiprocessing. Some tests run the server using threading. It appears that the server isn't exiting, but merely hanging.

Do you have any suggestions for how to debug this? I've tried setting PYTHONASYNCIODEBUG=1 and enabling debug logging, but that isn't yielding any additional information..

@1st1
Copy link
Member

1st1 commented Jul 7, 2016

The general advice is to create a minimal testcase and then debug the code with gdb (or lldb) and print statements ;)

In this particular case, could you please tell me the values of addr and ssl? Does the loop have any other scheduled tasks? Does it run in the main thread?

@1st1
Copy link
Member

1st1 commented Jul 7, 2016

I'd also suggest to add prints before each await in Loop.create_server. That should give us better understanding where exactly the loop hangs.

@1st1
Copy link
Member

1st1 commented Jul 7, 2016

Hm, actually, there is only one await in Loop.create_server, and it waits on getaddrinfo. So it might be a bug in the DNS resolver. What's the value of addr?

@1st1
Copy link
Member

1st1 commented Jul 7, 2016

And there's also an option to make a debug build with make debug. When built in debug mode, loop has a print_debug_info() method which outputs a bunch of useful debug info.

@jimfulton
Copy link
Author

On Thu, Jul 7, 2016 at 5:39 PM, Yury Selivanov notifications@github.com
wrote:

The general advice is to create a minimal testcase and then debug the code
with gdb (or lldb) and print statements ;)

Yeah. I took an initial stab at that, but it was too minimal. I'll take
another stab.

In this particular case, could you please tell me the values of addr and
ssl?

ssl is usually, but not always None. I just hacked out SSL-related tests
and ssl support, and still hanging. So for the sake of this discussion,
SSL is always None.

The address is almost always ('127.0.0.1', 0). I just did some test
cleanup to make this so. :) Previously, it often had a host of
'localhost' or '' and a randomly selected port. There may also be a
test or 2 that uses unix-domain sockets. (If not, I should add one. :) ).

Does the loop have any other scheduled tasks?

No.

Does it run in the main thread?

No. (I don't currently call set_event_loop, because I pass the loop around
and nothing was relying on the default event loop. Previously, I was
calling set_event_loop, and the server was hanging then as well.

I'm 70% sure this is a resource exhaustion problem of some sort.

Also worth noting:

  • I tried just using uvloop for clients and still got a test failure, one
    of which had to do with running the client within a multiprocessing Process.
  • The test servers almost always are run in a multi-Processing process.

Jim

Jim Fulton
http://jimfulton.info

@jimfulton
Copy link
Author

On Thu, Jul 7, 2016 at 6:05 PM, Yury Selivanov notifications@github.com
wrote:

Hm, actually, there is only one await in Loop.create_server, and it waits
on getaddrinfo. So it might be a bug in the DNS resolver. What's the
value of addr?

Almost always ('127.0.0.1', 0)

@jimfulton
Copy link
Author

On Thu, Jul 7, 2016 at 6:45 PM, Yury Selivanov notifications@github.com
wrote:

And there's also an option to make a debug build with make debug. When
built in debug mode, loop has a print_debug_info() method which outputs a
bunch of useful debug info.

OK, I'll try that. Would you suggest print_debug_info() here:

https://github.com/zopefoundation/ZEO/blob/debug-uvloop/src/ZEO/asyncio/server.py#L226

@1st1
Copy link
Member

1st1 commented Jul 8, 2016

OK, I'll try that. Would you suggest print_debug_info() here:

The idea is to create a debug task that will periodically print debug info on the screen. This is how I use it in uvloop benchmarks:
https://github.com/MagicStack/uvloop/blob/master/examples/bench/echoserver.py#L78

I'm thinking about adding a manhole or something, that would let you attach to a running event loop from the outside.

@jimfulton
Copy link
Author

The program was running out of file descriptors. Not an issue with uvloop at all. Sorry for the noise.

@1st1
Copy link
Member

1st1 commented Aug 6, 2016

Thanks a lot for confirming this!

@jimfulton
Copy link
Author

jimfulton commented Aug 7, 2016

I was wrong. It has nothing to do with running out of file descriptors. (I was able to make uvloop fail when having too many open files, but that's not what made the tests fail.)

Many of the ZEO tests are integration tests. Typically, these tests run a server in a subprocess using multiprocessing. A few tests run the server in a thread, which makes it possible to introspect server state. Mixing thread-based and multiprocessing-based tests leads to hanging in the test runner when using uvloop.

Here's a minimal script that demonstrates the problem:

https://gist.github.com/jimfulton/59c02a96ebe4720d512b2b0ee426c7ed

I'm not sure if this issue is specific to multiprocessing (I suspect so because magic :) ) or would also be an issue with other ways of running a server in a subprocess.

@jimfulton jimfulton reopened this Aug 7, 2016
jimfulton pushed a commit to zopefoundation/ZEO that referenced this issue Aug 7, 2016
uvloop is a bit faster than the standard asyncion event loop.

Unfortunately, there's an issue with the ZEO tests and using uvloop in
the ZEO server:

  MagicStack/uvloop#39

Fortunately, most of the performamce benefits of using uvloop seems to
be in the client.

For now, we'll just use ivloop on the client side, as the incremental
effort of using it in the server aren't worth futher wrestling with
the tests. (I spent quite a bit of time just narrowing down the cause
of the test issue.)

With this PR, if uvloop can be imported, it will be used for the
client. (uvloop requires Python 3.5 or later and doesn't support Windows.)
@1st1
Copy link
Member

1st1 commented Aug 8, 2016

Interesting. Thanks a lot for the script -- will investigate tomorrow.

jimfulton pushed a commit to zopefoundation/ZEO that referenced this issue Aug 8, 2016
uvloop is a bit faster than the standard asyncion event loop.

Unfortunately, there's an issue with the ZEO tests and using uvloop in
the ZEO server:

  MagicStack/uvloop#39

Fortunately, most of the performamce benefits of using uvloop seems to
be in the client.

For now, we'll just use ivloop on the client side, as the incremental
effort of using it in the server aren't worth futher wrestling with
the tests. (I spent quite a bit of time just narrowing down the cause
of the test issue.)

With this PR, if uvloop can be imported, it will be used for the
client. (uvloop requires Python 3.5 or later and doesn't support Windows.)
@1st1
Copy link
Member

1st1 commented Aug 8, 2016

Looks like this bug is in libuv. A call to uv_getaddrinfo never calls back the passed callback function. I suspect that there is a lock or something that breaks on fork() + starting a new event loop in the forked process.

Options:

  1. Start using Python's getaddrinfo in a Python threadpool.
  2. Fix the bug in libuv.
  3. Switch to c-ares or getdns (I wanted to do this sometime anyways).

Not sure how to even estimate which option is more preferable here.

cc @saghul

@saghul
Copy link

saghul commented Aug 8, 2016

I suspect that there is a lock or something that breaks on fork() + starting a new event loop in the forked process.

This is the case, yes. The threadpool has some global state which needs to be adjusted in the child. Since uv_getaddrinfo runs on the threadpool it will fail. Basically using any loop before fork() and trying to use them (or a new one) is not currently supported.

There is, however, a pull request which addresses this: libuv/libuv#846 more eyes are always welcome!

Start using Python's getaddrinfo in a Python threadpool.

You might want to do this for another reason though: Python's getaddrinfo is not quite like getaddrinfo(3), it makes some normalizations and other stuff I no longer remember but it might be a good idea to use it, for consistency.

@1st1
Copy link
Member

1st1 commented Aug 8, 2016

I suspect that there is a lock or something that breaks on fork() + starting a new event loop in the forked process.

This is the case, yes. The threadpool has some global state which needs to be adjusted in the child. Since uv_getaddrinfo runs on the threadpool it will fail. Basically using any loop before fork() and trying to use them (or a new one) is not currently supported.

There is, however, a pull request which addresses this: libuv/libuv#846 more eyes are always welcome!

I've seen that PR, it looks good to me (I spent quite a bit of time looking and playing with libuv internals). But I'm not a core libuv committer to say the LGTM :)

Start using Python's getaddrinfo in a Python threadpool.

You might want to do this for another reason though: Python's getaddrinfo is not quite like getaddrinfo(3), it makes some normalizations and other stuff I no longer remember but it might be a good idea to use it, for consistency.

Yes, but I already have all the code to normalize libuv results into what Python code is expecting to see. All that code would be useful for c-ares or getdns though.

@jimfulton How serious is this bug? Does it preclude you from using ZODB with uvloop?

@jimfulton
Copy link
Author

The bug is mainly an inconvenience at this point.

Informal performance measurements indicate that most of the win is on the client, so for the moment, I'm not using uvloop on the server. I could use it on the server if I rearrange some tests, which I think I'll probably do. (If it was easy for you to fix, I'd have waited, but I'll go ahead and work around it.)

(Lack of connect_accepted_socket is also preventing using on the server for the main project I'm working on. :) )

@1st1
Copy link
Member

1st1 commented Aug 8, 2016

I'll try to add connect_accepted_socket to uvloop today :)

@saghul
Copy link

saghul commented Aug 8, 2016

I've seen that PR, it looks good to me (I spent quite a bit of time looking and playing with libuv internals). But I'm not a core libuv committer to say the LGTM :)

If you don't mind, please leave a comment, it helps.

All that code would be useful for c-ares or getdns though.

I'm not sure about that. c-ares does not implement getaddrinfo, and I don't know about getdns. You can make shim using A and AAAA queries, but mimicking the system getaddrinfo will be tricky.

@1st1
Copy link
Member

1st1 commented Aug 9, 2016

@saghul I've left a comment for libuv/libuv#846 -- it solves this particular bug and other bugs we have with os.fork calls. When do you think it can be merged? (I really, really don't want to use patched libuv).

@saghul
Copy link

saghul commented Aug 9, 2016

@1st1 thanks! I can't really give an estimate, it's a (relatively) big change, so we are cautious.

jimfulton pushed a commit to zopefoundation/ZEO that referenced this issue Aug 9, 2016
To accomplish this, it was necessary to rearrange the tests so that
tests that ran servers in threads rather than using multiprocessing
into their own layer.  This is due to a bug currently in uvloop that
prevents using running uvloop servers in a process and in subprocesses
created with multiprocessing:

MagicStack/uvloop#39

To run the tests with uvloop installed, it's necessary to use the
``-j`` option to run layers in separate processes.
@1st1
Copy link
Member

1st1 commented Aug 10, 2016

@jimfulton FWIW uvloop v0.5.1 has loop.connect_accepted_socket()!

@jimfulton
Copy link
Author

Great! I'll try it today.

BTW, I've updated the ZEO tests so they pass with uvloop (when run a certain way that runs groups of tests in separate processes).

@jimfulton
Copy link
Author

connect_accepted_socket works great. :) ZEO is now using it in it's multi-threaded-server configuration. Thanks.

@1st1
Copy link
Member

1st1 commented Aug 11, 2016

Finally some good news ;)

@1st1
Copy link
Member

1st1 commented Nov 13, 2017

Fixed in master. Will be in the next uvloop release (0.9.0) soon.

@1st1 1st1 closed this as completed Nov 13, 2017
@1st1
Copy link
Member

1st1 commented Nov 27, 2017

Please try uvloop v0.9.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants