Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist: Implement async processing of queries and responses #12388

Merged
merged 17 commits into from
Jan 24, 2023

Conversation

rgacogne
Copy link
Member

@rgacogne rgacogne commented Jan 6, 2023

Short description

This PR implements the ability to process queries and responses in an asynchronous way, suspending them to continue processing other queries and responses, while we are waiting for an external event to occur.
It also implements the ability to restart a query based on the content of the response, for example to re-try sending the initial query to a different pool of servers if the initial attempts yields a ServFail or Refused response code.

The documentation needs to be updated before merging this PR.

Note that this pull request builds on #12383 and will have to be rebased.

Closes #10690

Checklist

I have:

  • read the CONTRIBUTING.md document
  • compiled this code
  • tested this code
  • included documentation (including possible behaviour changes)
  • documented the code
  • added or modified regression test(s)
  • added or modified unit test(s)

@rgacogne rgacogne added this to the dnsdist-1.8.0 milestone Jan 6, 2023
@rgacogne rgacogne requested review from chbruyand and removed request for chbruyand January 6, 2023 09:44
@rgacogne rgacogne force-pushed the ddist-async branch 2 times, most recently from d72a142 to 80188a2 Compare January 11, 2023 14:05
@rgacogne rgacogne marked this pull request as ready for review January 11, 2023 14:06
@rgacogne rgacogne requested review from chbruyand and omoerbeek and removed request for chbruyand January 11, 2023 14:07
@phonedph1
Copy link
Contributor

I might have missed it, but are dq:setRestartable() and dq:restart() in the docs somewhere? The test case for this looks like it will do some pretty nice stuff - I almost feel like this test case should be copied into the docs somewhere for how simplistic to configure but powerful it is in use.

@rgacogne
Copy link
Member Author

I might have missed it, but are dq:setRestartable() and dq:restart() in the docs somewhere? The test case for this looks like it will do some pretty nice stuff - I almost feel like this test case should be copied into the docs somewhere for how simplistic to configure but powerful it is in use.

You did not miss it, I did not add that part of the documentation yet but I will do that soon!

@rgacogne
Copy link
Member Author

You did not miss it, I did not add that part of the documentation yet but I will do that soon!

Done so now, thanks :-)

@rgacogne
Copy link
Member Author

Rebased to fix conflicts.

@rgacogne
Copy link
Member Author

Rebased to fix (new) conflicts.

Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started with reading the docs, notes a few things.

To be continued.

pdns/dnsdistdist/docs/reference/dq.rst Outdated Show resolved Hide resolved
pdns/dnsdistdist/docs/reference/dq.rst Outdated Show resolved Hide resolved
pdns/dnsdistdist/docs/reference/dq.rst Outdated Show resolved Hide resolved
pdns/dnsdistdist/docs/reference/dq.rst Show resolved Hide resolved
Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some local nits noted. Still have to try to grasp the whole thing.

pdns/dnsdistdist/dnsdist-lua-ffi.cc Outdated Show resolved Hide resolved
@@ -69,6 +73,7 @@ struct DNSQuestion
DNSQuestion(const DNSQuestion&) = delete;
DNSQuestion& operator=(const DNSQuestion&) = delete;
DNSQuestion(DNSQuestion&&) = default;
virtual ~DNSQuestion() = default;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this virtual dt needed? Only once class derives from DNSQuestion and it does not override the dt (and adds no fields either). Though I see it will make dnsdist_ffi_dnsresponse_set_async more complicated, as it wants to distinguish the two one way or another.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure how we would get a DNSResponse pointer from a DNSQuestion pointer otherwise? We could use a union and a boolean in dnsdist_ffi_dnsquestion_t to store the correct type of pointer instead, of course, but it would impact all the existing FFI methods.

pdns/dnsdistdist/test-dnsdistasync.cc Outdated Show resolved Hide resolved
@rgacogne
Copy link
Member Author

I pushed a few commits addressing your remarks, thanks!

@rgacogne rgacogne requested a review from omoerbeek January 20, 2023 15:07
pdns/dnsdistdist/test-dnsdistasync.cc Outdated Show resolved Hide resolved
@rgacogne rgacogne requested a review from omoerbeek January 23, 2023 15:33
@omoerbeek
Copy link
Member

omoerbeek commented Jan 24, 2023

There is still something fishy going on on OpenBSD testrunner (both amd64 and arm64):

 16536/563053  testrunner CALL  nanosleep(0x7f7ffffcad18,0)
 16536/578170  testrunner CALL  kevent(5,0,0,0x488ede80e00,1,0x488c5f69100)
 16536/563053  testrunner STRU  struct timespec { 0.200000000 }
 16536/578170  testrunner STRU  struct timespec { 0.-01000000 }
 16536/578170  testrunner RET   kevent -1 errno 22 Invalid argument

investigating...

Added a check on negatve timeout in KqueueFDMultiplexer::getAvailableFDs:

(gdb) bt
#0  thrkill () at /tmp/-:3
#1  0xa8d5499e3a753aef in ?? ()
#2  0x00000912dcadd25e in _libc_abort () at /usr/src/lib/libc/stdlib/abort.c:51
#3  0x00000910c13b2329 in KqueueFDMultiplexer::getAvailableFDs (this=0x0, fds=..., 
    timeout=0) at kqueuemplexer.cc:145
#4  0x00000910c0f0948b in dnsdist::AsynchronousHolder::wait (data=..., mplexer=..., 
    readyFDs=..., atMostMs=0) at dnsdist-async.cc:88
#5  0x00000910c0f09753 in dnsdist::AsynchronousHolder::mainThread (data=...)
    at dnsdist-async.cc:161
#6  0x00000910c0f0b769 in dnsdist::AsynchronousHolder::AsynchronousHolder(bool)::$_0::operator()() const (this=<optimized out>) at dnsdist-async.cc:52
#7  std::__1::__invoke<dnsdist::AsynchronousHolder::AsynchronousHolder(bool)::$_0>(dnsdist::AsynchronousHolder::AsynchronousHolder(bool)::$_0&&) (__f=...)
    at /usr/include/c++/v1/type_traits:3918
#8  std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dnsdist::AsynchronousHolder::AsynchronousHolder(bool)::$_0>(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dnsdist::AsynchronousHolder::AsynchronousHolder(bool)::$_0>&, std::__1::__tuple_indices<>) (__t=...)
    at /usr/include/c++/v1/thread:280
#9  std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dnsdist::AsynchronousHolder::AsynchronousHolder(bool)::$_0> >(void*) (__vp=0x9130bda86c0)
    at /usr/include/c++/v1/thread:291
#10 0x00000912e1e35731 in _rthread_start (v=<optimized out>)
    at /usr/src/lib/librthread/rthread.c:96
#11 0x00000912dcabbc5a in __tfork_thread ()
    at /usr/src/lib/libc/arch/amd64/sys/tfork_thread.S:86

Looks like the default value -1 of timeout in mainThread is used. running with a default value of 1 works better.

BTW earlier I suggested to use timeradd and timersub. But now I see we also have

void normalizeTV(struct timeval& tv)
const struct timeval operator+(const struct timeval& lhs, const struct timeval& rhs)
const struct timeval operator-(const struct timeval& lhs, const struct timeval& rhs)

in misc.cc (with a funny const return values).
e.g. suspendQuery and suspendResponse can use normalizeTV. And the tests can use the operators. Sorry for not spotting these methods earlier.

Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit and and a bug, described in comments.

pdns/lock.hh Show resolved Hide resolved
@rgacogne
Copy link
Member Author

Looks like the default value -1 of timeout in mainThread is used. running with a default value of 1 works better.

When we have a bigger problem than this PR, because mplexer.hh documents -1 as 'will wait forever', and this is the behaviour I want, not being woken up every second for no reason.. I guess it has always been broken with kqueue, I'll investigate.

@omoerbeek
Copy link
Member

Looks like the default value -1 of timeout in mainThread is used. running with a default value of 1 works better.

When we have a bigger problem than this PR, because mplexer.hh documents -1 as 'will wait forever', and this is the behaviour I want, not being woken up every second for no reason.. I guess it has always been broken with kqueue, I'll investigate.

kqueue expects NULL ts for infinite timeout

@rgacogne
Copy link
Member Author

BTW earlier I suggested to use timeradd and timersub. But now I see we also have

void normalizeTV(struct timeval& tv)
const struct timeval operator+(const struct timeval& lhs, const struct timeval& rhs)
const struct timeval operator-(const struct timeval& lhs, const struct timeval& rhs)

in misc.cc (with a funny const return values). e.g. suspendQuery and suspendResponse can use normalizeTV. And the tests can use the operators. Sorry for not spotting these methods earlier.

Nice, thank you! I pushed that change now.

@omoerbeek
Copy link
Member

Tests OK now on both macOS and OpenBSD.

@rgacogne rgacogne requested a review from omoerbeek January 24, 2023 12:26
Copy link
Member

@omoerbeek omoerbeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot say I checked every detail, but this reads good now and after the latest changes I've run tests on macOS, OpenBSD and debian bullseye successfully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow falling back to using authoritatives if forwarders respond with servfail
3 participants