Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow DefaultSslEngineFactory subclass customization of the SslContext #1171

Merged

Conversation

marshallpierce
Copy link

See #1170 for context.

If you have ideas for how to usefully test this, I'm happy to write them up, but it wasn't obvious to me how to usefully test this change.


private SslContext getSslContext(AsyncHttpClientConfig config) throws SSLException {
private SslContext buildSslContext(AsyncHttpClientConfig config) throws SSLException {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this code where it was in the source but if you don't mind the reshuffling, I'll move it down to the bottom with the other private stuff

@slandelle
Copy link
Contributor

Headache... Actually, it bothers me if we only detect SslContext building crash on first request instead of client instantiation...

@marshallpierce
Copy link
Author

marshallpierce commented Jun 2, 2016

Yeah, that's really not very nice at all. :( If breaking compatibility is out of the question, one thing we could do is add a pre-emptive initialization call in ChannelManager in the typical case where the factory is the default impl (either a custom subclass or not): cast, and then call some method that's not in the interface.

Perhaps at the next opportunity to break compatibility we could add some sort of init() to the factory interface. This would allow us to cleanly throw from such initialization errors, and also remove the double-checked locking and have simply a volatile.

@marshallpierce
Copy link
Author

Oh wait, this project is Java 8 -- couldn't we just add an init(config) with a default no-op impl to the factory interface?

@slandelle
Copy link
Contributor

Oh right! Can you give it a try?

} catch (SSLException e) {
throw new ExceptionInInitializerError(e);
throw new RuntimeException("Could not initialize sslEngineFactory", e);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was a more helpful exception

@marshallpierce marshallpierce changed the title Change DefaultSslEngineFactory to lazily build its SslContext Allow DefaultSslEngineFactory subclass customization of the SslContext Jun 2, 2016
@marshallpierce
Copy link
Author

marshallpierce commented Jun 2, 2016

btw, I can't run tests on my Linux box - BasicHttpProxyToHttpTest.nonPreemptyProxyAuthWithPlainHttpTarget consistently hangs. It works on my OS X laptop. Is this a known issue? It happens in untouched upstream as well as my fork.

Specifically, it looks like it makes a request to the proxy, which then requires authorization, so it makes another request, and the proxy tries to forward, and nothing happens:

2016-06-02 08:55:49,088 [qtp1397616978-24] DEBUG org.asynchttpclient.BasicHttpProxyToHttpTest - >>> got a request !
2016-06-02 08:55:49,088 [qtp1397616978-24] DEBUG org.asynchttpclient.BasicHttpProxyToHttpTest$BasicAuthProxyServlet.2d9d4f9d - 374875871 rewriting: http://localhost:41740/foo/bar -> http://localhost:41740/foo/bar
2016-06-02 08:55:49,094 [qtp1397616978-24] DEBUG org.asynchttpclient.BasicHttpProxyToHttpTest$BasicAuthProxyServlet.2d9d4f9d - 374875871 proxying to upstream:
GET /foo/bar HTTP/1.1
User-Agent: AHC/2.0
Host: localhost:41740
Accept: */*
Proxy-Authorization: Basic am9obmRvZTpwYXNz

HttpRequest[GET /foo/bar HTTP/1.1]@1dba6237

... and eventually timeout.

Watching in Wireshark, nothing is ever sent to the upstream server. I see two GETs to the proxy but nothing to the upstream server.

@slandelle
Copy link
Contributor

BasicHttpProxyToHttpTest.nonPreemptyProxyAuthWithPlainHttpTarget consistently hangs. It works on my OS X laptop. Is this a known issue? It happens in untouched upstream as well as my fork.

Oh my... I run on OSX and was trusting Travis to execute the tests on Linux, but I just realized that Travis skip tests by default!!! Investigating.

@slandelle
Copy link
Contributor

Actually, there was an issue with the Travis set up, tests are currently running: https://travis-ci.org/AsyncHttpClient/async-http-client/builds/134807126

@marshallpierce
Copy link
Author

FWIW I get the following on a full run:


Failed tests:
  BasicHttpProxyToHttpTest.nonPreemptyProxyAuthWithPlainHttpTarget:119 expected [200] but found [504]
  ProxyTest.testRequestNonProxyHost:176 » IllegalState failed to create a child ...
  ReactiveStreamsTest.testConnectionDoesNotGetClosed:66 arrays differ firstly at element [12000]; expected value is <111> but was <45>.

The last one only shows up sometimes. The second one shows up every time I do a full run, but doesn't fail when run by itself.

@slandelle
Copy link
Contributor

How do you run maven? By any chance, do you use parallel builds?

@marshallpierce
Copy link
Author

mvn clean install. No settings.xml. No parallel builds, version 3.3.9.

@marshallpierce
Copy link
Author

I'll see if I can reproduce this in a VM.

@slandelle
Copy link
Contributor

I'll investigate tomorrow at the office too. Those issues are very weird. The IllegalStateException usually means connectivity issues or ports starvation.

@marshallpierce
Copy link
Author

In case it comes in handy, I pushed https://github.com/marshallpierce/async-http-client/tree/debug-test-failures-vagrant which has a vagrant box. Curiously, I get different issues in the VM:

Running TestSuite
Tests run: 395, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 134.372 sec <<< FAILURE! - in TestSuite
testListenableFutureAfterCompletion(org.asynchttpclient.ListenableFutureTest)  Time elapsed: 0.028 sec  <<< FAILURE!
java.lang.AssertionError: expected [1] but found [0]
        at org.testng.Assert.fail(Assert.java:94)
        at org.testng.Assert.failNotEquals(Assert.java:496)
        at org.testng.Assert.assertEquals(Assert.java:125)
        at org.testng.Assert.assertEquals(Assert.java:372)
        at org.testng.Assert.assertEquals(Assert.java:382)
        at org.asynchttpclient.ListenableFutureTest.testListenableFutureAfterCompletion(ListenableFutureTest.java:63)


@slandelle
Copy link
Contributor

slandelle commented Jun 3, 2016

WAT?!

I only see 2 changes that could be related:

  • I modified AbstractListenableFuture to use the same design as your PR for safe publication, see 2addbc0 (you don't have this commit in your fork).
  • I updated our internal copy of ExecutionList, which comes from Guava, in order to benefit from this commit, see f4f6129

Could you try adding the former AND/OR reverting the latter, please?

I'll try to reproduce on my side today.

@slandelle
Copy link
Contributor

I could reproduce the 504 on Travis: https://travis-ci.org/AsyncHttpClient/async-http-client#L5629

@slandelle
Copy link
Contributor

I could get the Travis build to pass if I ask surefire to not fork and execute tests sequentially, see 8c054a3 and https://travis-ci.org/AsyncHttpClient/async-http-client/builds/134987736.

And I can't reproduce the issue on my local Linux box. Will try your vagrant.

@slandelle
Copy link
Contributor

I can't reproduce on VirtualBox either :(

And actually build failed again on Travis... I'm at a loss.
What's your host config?

@slandelle
Copy link
Contributor

slandelle commented Jun 3, 2016

ProxyTest.testRequestNonProxyHost:176 » IllegalState failed to create a child ...
shows up every time I do a full run, but doesn't fail when run by itself.

I would be awesome if you could get the full stacktrace. In order to get surefire to output logs, you have to enable a profile: mvn test -Ptest-output.

This message comes from Netty, and I think the root issue is there, like lacking/leaking native memory. I really need the full stacktrace with the cause Exception.

@slandelle
Copy link
Contributor

Pinned it: #1172

@marshallpierce
Copy link
Author

I'll dig in once I'm at the office (about an hour). The host is an E5-1650 v3 w/ 64GiB, running up-to-date Gentoo Linux with Oracle JDK 1.8u92. It's not running with SELinux or restrictive sysctl params or anything like that.

@slandelle
Copy link
Contributor

Check last comment: issue solved :)
8951a89#diff-09326dc85cb5946ac041536f7475c385R90

@slandelle
Copy link
Contributor

(cont)

At least the BasicHttpProxyToHttpTest.
I couldn't figure out exactly where the lock was as I could only reproduce on Travis, where I can't of course make a thread dump. But tests are all green on Travis now.

I couldn't reproduce the other ones.

@marshallpierce
Copy link
Author

On 9f817eb I get:

Failed tests: 
  RemoteSiteTest.stripQueryStringTest:164 expected [200] but found [503]
  ProxyTest.testRequestNonProxyHost:176 » IllegalState failed to create a child ...

Stack traces:

 testRequestNonProxyHost
java.lang.IllegalStateException: failed to create a child event loop
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:68)
    at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:49)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:68)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:63)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:54)
    at org.asynchttpclient.netty.channel.ChannelManager.<init>(ChannelManager.java:173)
    at org.asynchttpclient.DefaultAsyncHttpClient.<init>(DefaultAsyncHttpClient.java:81)
    at org.asynchttpclient.DefaultAsyncHttpClient.<init>(DefaultAsyncHttpClient.java:63)
    at org.asynchttpclient.Dsl.asyncHttpClient(Dsl.java:25)
    at org.asynchttpclient.proxy.ProxyTest.testRequestNonProxyHost(ProxyTest.java:176)
Caused by: io.netty.channel.ChannelException: failed to open a new selector
    at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:143)
    at io.netty.channel.nio.NioEventLoop.<init>(NioEventLoop.java:134)
    at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:94)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:64)
    ... 37 more
Caused by: java.io.IOException: Too many open files
    at sun.nio.ch.IOUtil.makePipe(Native Method)
    at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:65)
    at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
    at io.netty.channel.nio.NioEventLoop.openSelector(NioEventLoop.java:141)
    ... 40 more
... Removed 28 stack frames

and

 stripQueryStringTest
java.lang.AssertionError: expected [200] but found [503]
    at org.asynchttpclient.RemoteSiteTest.stripQueryStringTest(RemoteSiteTest.java:164)
... Removed 33 stack frames

The first one I cannot reproduce when running by itself. The latter one I did on the third try. However, that one appears to be hitting the live freakonomics.com site, so it's probably unavoidable without reimplementing using a local http server.

ulimit -n reports 1024 on my Linux box, which is pretty typical. I can certainly make it a massive number to make the tests pass, but the fact that it fails in an all-tests run and passes by itself makes your resource leak theory seem pretty likely. Some casual lsofing while running tests does show a lot of pipes allocated. Annoyingly, setting ulimit -n 1024 on OS X doesn't make the build fail.

@marshallpierce
Copy link
Author

Oh, apparently the hard limit on OS X is unlimited anyway.

Re-running on linux with ulimit -n 4096 still fails on testRequestNonProxyHost. lsofing for pipes owned by java showed counts as high as 36000 during the test run, and was typically above 10000, so clearly something is unhappy.

@slandelle
Copy link
Contributor

The first one I cannot reproduce when running by itself. The latter one I did on the third try. However, that one appears to be hitting the live freakonomics.com site, so it's probably unavoidable without reimplementing using a local http server.

Yeah, some tests where originally introduced this way (before I joined here). There are some issues open for refactoring them into proper standalone one, such as #496. Help welcome. The "freakonomics.com" website is sometimes unstable.

ulimit -n reports 1024 on my Linux box, which is pretty typical. I can certainly make it a massive number to make the tests pass, but the fact that it fails in an all-tests run and passes by itself makes your resource leak theory seem pretty likely.

Maybe, but not necessarily. It could be that we have tons of ports that have been opened, but can't be recycled yet.
Default ulimit (1024) is a value suited for desktop, not server.

Some casual lsofing while running tests does show a lot of pipes allocated.

Sure:

  • every test class starts one or two embedded Jetty (hence server ports)
  • every test method start one client (hence one client port per connection)

@slandelle
Copy link
Contributor

lsof accounts for all open files, not only sockets. What does netstat says?

@marshallpierce
Copy link
Author

Is there anything in particular you'd like to look at in netstat output? I don't see anything other than a hundred TIME_WAIT sockets, which since they've already been closed are not associated with the opening process. In my lsof'ing I'm grepping for this (with all other java processes closed):

lsof | grep '^java' | grep pipe | wc -l

Or this:

lsof | grep '^java' | grep eventpoll | wc -l

I can't think of a good reason why that should output 121256 eventpoll fds, or 85741 pipes, and yet I've seen those numbers. Even if the sockets were persisting after being properly closed, they shouldn't be tied to that pid anymore.

@marshallpierce
Copy link
Author

I also found one of these in the test log:

2016-06-03 08:52:35,827 [Finalizer] ERROR org.asynchttpclient.DefaultAsyncHttpClient - AsyncHttpClient.close() hasn't been invoked, which may produce file descriptor leaks

So, we've got at least one leak.

@marshallpierce
Copy link
Author

I stuck a Thread.sleep(Long.MAX_VALUE) at the top of testRequestNonProxyHost to give me some time to investigate. After dumping the (live only) heap with jmap, some cursory jhat tells me that there are 800 live instances of sun.nio.ch.EPollSelectorImpl, the class whose ctor fails with the ISE. That seems wrong for a test.

When I run just that test alone and take a live heap dump while it's sleeping at the beginning of the test, I see 8 instances, so it seems likely that the other 792 are leaked from previous tests.

@slandelle
Copy link
Contributor

2016-06-03 08:52:35,827 [Finalizer] ERROR org.asynchttpclient.DefaultAsyncHttpClient - AsyncHttpClient.close() hasn't been invoked, which may produce file descriptor leaks

This was probably a derived client no being closed: f402385#diff-8bec36202922bc4718c8cabbdf131ea0L276.
False positive.

I did found tons of clients not being closed but in an extra module that's build downstream.

@marshallpierce
Copy link
Author

I've rebased this on top of latest master. Also, if travis will let you, would be interesting to add 'ulimit -H -n 4000' so we have a known config there. The build passes with 4000 for me. (The cutoff appears to be between 3500 and 4000.)

@slandelle
Copy link
Contributor

Done

@marshallpierce
Copy link
Author

Is there anything else you'd like me to improve on this, or are you happy with the current approach?

public DefaultSslEngineFactory(AsyncHttpClientConfig config) throws SSLException {
this.sslContext = getSslContext(config);
}
public DefaultSslEngineFactory() { }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove explicit default constructor.

@slandelle
Copy link
Contributor

Could you please address last comment and squash?

…makes it feasible provide a subclass of DSEF via config that can customize the SslContext as it is built. Let SSLContext errors be detected early by adding init() to SslEngineFactory.
@slandelle slandelle added this to the 2.0.5 milestone Jun 7, 2016
@slandelle slandelle merged commit 6627c53 into AsyncHttpClient:master Jun 7, 2016
@slandelle
Copy link
Contributor

Thanks!

@slandelle
Copy link
Contributor

slandelle commented Jun 7, 2016

@marshallpierce I've just released 2.0.5

@marshallpierce
Copy link
Author

2.0.5 is working great with a custom subclass of DSEF; thanks!

@slandelle
Copy link
Contributor

Great! Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants