Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: Remove unreachable code in HTTP/2 connections cleanup #10920

Merged
merged 10 commits into from
Nov 15, 2021

Conversation

rgacogne
Copy link
Member

Short description

Reported by Coverity (CID 373724).

It also immediately removes a HTTP/2 connection from the list if it is no longer usable (TCP socket is dead).

Checklist

I have:

  • read the CONTRIBUTING.md document
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

@rgacogne
Copy link
Member Author

I fixed the two reported problems, but this calls for a refactoring and some tests, perhaps in a separate PR.

@rgacogne
Copy link
Member Author

rgacogne commented Nov 3, 2021

Refactoring done, and unit tests added (fixing a bug in the process, as always). Also fixed a race in the AXFR regression tests.

omoerbeek
omoerbeek previously approved these changes Nov 5, 2021
@rgacogne
Copy link
Member Author

rgacogne commented Nov 5, 2021

I think I have spotted an issue in the way "idle" connections are expunged when the limit is reached, it looks like 1/ we count all connections, not only the idle ones 2/ we can expunge a non-idle connection.

@omoerbeek
Copy link
Member

omoerbeek commented Nov 5, 2021

I think I have spotted an issue in the way "idle" connections are expunged when the limit is reached, it looks like 1/ we count all connections, not only the idle ones 2/ we can expunge a non-idle connection.

I missed that. I suppose you are talking about the last part of getConnectionToDownstream() in dnsdist-tcp-downstream.hh. As for the counting, shouldn't idle connection be dropped by the call to cleanupClosedConnections() before? In that case the max is real and no new connection should be allowed I think. Which means the check should be before the creation of newConnection.
But I don't think the callers of getConnectionToDownstream() are prepared for a failure case...

@omoerbeek omoerbeek self-requested a review November 5, 2021 11:45
@omoerbeek omoerbeek dismissed their stale review November 5, 2021 11:46

Missed maxCachedConnectionsPerDownstream case and it's not clear to me how that should be fixed

@rgacogne
Copy link
Member Author

rgacogne commented Nov 5, 2021

The problem comes from the fact that before 1.7.0, outgoing TCP connections were only added to the list when they were idle, since the same connection was never used for more than one incoming client. But in 1.7.0 both DoH and DoTCP53/DoT connections can be used for several incoming clients, and are therefore always added to that list. So the setting limiting the number of idle connections (I don't it would make sense to limit the number of active connections) should no longer apply to all the connections in there, only the idle ones.
Perhaps we should have two lists, then..

@rgacogne
Copy link
Member Author

rgacogne commented Nov 5, 2021

I pushed a new commit splitting the list in two: one for active connections, and one for idle ones. That way it's easy to count the number of idle connections and to expunge one of them when the limit is reached. It also makes it easier to try the idle connections first when we need a new one, since we are more likely to be able to write to it right away, instead of having to wait for a pending one to be sent when reusing an active connection.

@rgacogne
Copy link
Member Author

rgacogne commented Nov 5, 2021

Needs a serious review, though. I'll do more testing on Monday.

@rgacogne
Copy link
Member Author

rgacogne commented Nov 8, 2021

Rebased due to a conflict. Added a new unit test as well.

@omoerbeek
Copy link
Member

omoerbeek commented Nov 10, 2021

Seeing this on OpenBSD about 1 in 3 runs:

[otto@lou:172]$ ./testrunner                                                                                         
Running 134 test cases...
test-dnsdisttcp_cc.cc(217): fatal error: in "test_dnsdisttcp_cc/test_IncomingConnection_BackendNoOOOR": critical check step.request == !d_client ? ExpectedStep::ExpectedRequest::closeClient : ExpectedStep::ExpectedRequest::closeBackend has failed [connect to the backend != close connection to backend]
libc++abi: terminating with uncaught exception of type boost::execution_aborted
unknown location(0): fatal error: in "test_dnsdisttcp_cc/test_IncomingConnection_BackendNoOOOR": signal: generated by kill() (or family); uid=0; pid=0
test-dnsdisttcp_cc.cc(217): last checkpoint

*** 2 failures are detected in the test module "unit"

and also a (single) case of:

test-dnsdistnghttp2_cc.cc(405): fatal error: in "test_dnsdistnghttp2_cc/test_SingleQuery": critical check step.request == !d_client ? ExpectedStep::ExpectedRequest::closeClient : ExpectedStep::ExpectedRequest::closeBackend has failed [connect to the backend != close connection to backend]
libc++abi: terminating with uncaught exception of type boost::execution_aborted
unknown location(0): fatal error: in "test_dnsdistnghttp2_cc/test_SingleQuery": signal: generated by kill() (or family); uid=0; pid=0
test-dnsdistnghttp2_cc.cc(405): last checkpoint

@rgacogne
Copy link
Member Author

So both times we decide to close a backend connection while we expected to open one, apparently. Would you be able to re-run these tests after recompiling with the #if 0 line in tcpiohandler-mplexer.hh set to #if 1, by any chance?

@omoerbeek
Copy link
Member

Here you are:
log.txt.gz

@omoerbeek
Copy link
Member

I ran the unittests on the master branch (still OpenBSD) and I'm seeing the same failures. So this is not a problem introduced by this PR.

@rgacogne
Copy link
Member Author

Thanks! Still I would like to figure out what is happening, I'm running the tests in a loop on Linux and no error so far..
I'll prepare a PR adding more debug lines..

Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far only small stuff seen. Will continue review Monday.

pdns/dnsdistdist/dnsdist-tcp-downstream.hh Outdated Show resolved Hide resolved
pdns/dnsdistdist/dnsdist-tcp-downstream.hh Show resolved Hide resolved
pdns/dnsdistdist/dnsdist-tcp-downstream.hh Outdated Show resolved Hide resolved
pdns/dnsdistdist/dnsdist-tcp-downstream.hh Outdated Show resolved Hide resolved
Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice piece of code. Apart from some nits I did not spot an issue.

@rgacogne rgacogne merged commit 26d5575 into PowerDNS:master Nov 15, 2021
@rgacogne rgacogne deleted the ddist-cleanup-conns branch November 15, 2021 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants