Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsdist 1.4.0: LetsEncrypt's certbot fails to insert/delete records #8021

Closed
RvdE opened this issue Jul 3, 2019 · 4 comments
Closed

dnsdist 1.4.0: LetsEncrypt's certbot fails to insert/delete records #8021

RvdE opened this issue Jul 3, 2019 · 4 comments

Comments

@RvdE
Copy link
Contributor

@RvdE RvdE commented Jul 3, 2019

  • Program: dnsdist
  • Issue type: Bug report

Short description

When updating certificates with certbot's dns-rfc2136 plugin dnsdist 1.4.0 fails most of the time.
1.3.3 doesn't appear to suffer from that issue.

Environment

  • Operating system: FreeBSD 12.0-RELEASE-p7
  • Software version: 1.4.0-master (be4ad9f)
  • Software source: compiled myself from git clone

Steps to reproduce

  1. certbot certonly -d ns1.cainites.net -d $RANDOM.cainites.net --dns-rfc2136 --dry-run --dns-rfc2136-credentials /usr/local/etc/letsencrypt/dns-rfc2136.ini

Expected behaviour

certbot certonly -d ns1.cainites.net -d $RANDOM.cainites.net --dns-rfc2136 --dry-run --dns-rfc2136-credentials /usr/local/etc/letsencrypt/dns-rfc2136.ini
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator dns-rfc2136, Installer None


You have an existing certificate that contains a portion of the domains you
requested (ref: /usr/local/etc/letsencrypt/renewal/ns1.cainites.net.conf)

It contains these names: ns1.cainites.net

You requested these names for the new certificate: ns1.cainites.net,
15560.cainites.net.

Do you want to expand and replace this existing certificate with the new
certificate?


(E)xpand/(C)ancel: E
Renewing an existing certificate
Performing the following challenges:
dns-01 challenge for 15560.cainites.net
Waiting 60 seconds for DNS changes to propagate
Waiting for verification...
Cleaning up challenges
Dry run: skipping deploy hook command: /usr/local/etc/letsencrypt/renewal-hooks/deploy/deploy.sh

IMPORTANT NOTES:

  • The dry run was successful.

Actual behaviour

certbot fails with an exception (EOF reading socket) either during the insertion or deletion of the challenge record

This is from running dnsdist on the foreground with the -v flag:
DNSDist 1.4.0-master (20190703):
==========~=
Got TCP connection from 127.0.0.1:32323
TCP connecting to downstream 127.0.0.1:5300 (0)
Got an exception while handling (writing to backend) TCP query from 127.0.0.1:32323: Error while reading message: Socket is not connected
TCP connecting to downstream 127.0.0.1:5300 (1)
Got query for ns1.cainites.net.|AAAA from 127.0.0.1:31815, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:31815, took 105.059 usec
Got query for ns1.cainites.net.|A from 127.0.0.1:10689, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:10689, took 70.455 usec
Got query for ns21.cloudns.net.|AAAA from 127.0.0.1:54477, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:54477, took 76.936 usec
Got query for ns21.cloudns.net.|A from 127.0.0.1:11778, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:11778, took 69.503 usec
Got query for ns22.cloudns.net.|AAAA from 127.0.0.1:19813, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:19813, took 69.438 usec
Got query for ns22.cloudns.net.|A from 127.0.0.1:22044, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:22044, took 61.719 usec
Got query for ns23.cloudns.net.|AAAA from 127.0.0.1:53025, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:53025, took 64.701 usec
Got query for ns23.cloudns.net.|A from 127.0.0.1:32718, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:32718, took 67.057 usec
Got query for ns24.cloudns.net.|AAAA from 127.0.0.1:13959, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:13959, took 73.791 usec
Got query for ns24.cloudns.net.|A from 127.0.0.1:63798, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:63798, took 65.28 usec
Got query for pns21.cloudns.net.|AAAA from 127.0.0.1:56366, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:56366, took 61.068 usec
Got query for pns21.cloudns.net.|A from 127.0.0.1:43639, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:43639, took 69.211 usec
Got query for pns22.cloudns.net.|AAAA from 127.0.0.1:57416, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:57416, took 73.394 usec
Got query for pns22.cloudns.net.|A from 127.0.0.1:38658, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:38658, took 63.199 usec
Got query for pns23.cloudns.net.|AAAA from 127.0.0.1:53442, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:53442, took 71.172 usec
Got query for pns23.cloudns.net.|A from 127.0.0.1:36906, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:36906, took 67.019 usec
Got query for pns24.cloudns.net.|AAAA from 127.0.0.1:20161, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:20161, took 67.241 usec
Got query for pns24.cloudns.net.|A from 127.0.0.1:25955, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:25955, took 60.841 usec
Closing TCP client connection with 127.0.0.1:32323
[..]
Got TCP connection from 127.0.0.1:18068
TCP connecting to downstream 127.0.0.1:5300 (0)
Got an exception while handling (writing to backend) TCP query from 127.0.0.1:18068: Error while reading message: Socket is not connected
TCP connecting to downstream 127.0.0.1:5300 (1)
Got query for ns1.cainites.net.|AAAA from 127.0.0.1:18295, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:18295, took 107.807 usec
Got query for ns1.cainites.net.|A from 127.0.0.1:41847, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:41847, took 81.001 usec
Got query for ns21.cloudns.net.|AAAA from 127.0.0.1:55773, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:55773, took 74.94 usec
Got query for ns21.cloudns.net.|A from 127.0.0.1:17868, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:17868, took 77.599 usec
Got query for ns22.cloudns.net.|AAAA from 127.0.0.1:58035, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:58035, took 70.885 usec
Got query for ns22.cloudns.net.|A from 127.0.0.1:54827, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:54827, took 77.941 usec
Got query for ns23.cloudns.net.|AAAA from 127.0.0.1:28186, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:28186, took 71.678 usec
Got query for ns23.cloudns.net.|A from 127.0.0.1:27815, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:27815, took 75.778 usec
Got query for ns24.cloudns.net.|AAAA from 127.0.0.1:52568, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:52568, took 71.4 usec
Got query for ns24.cloudns.net.|A from 127.0.0.1:39193, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:39193, took 77.399 usec
Got query for pns21.cloudns.net.|AAAA from 127.0.0.1:60810, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:60810, took 69.656 usec
Got query for pns21.cloudns.net.|A from 127.0.0.1:58300, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:58300, took 86.814 usec
Got query for pns22.cloudns.net.|AAAA from 127.0.0.1:22389, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:22389, took 70.531 usec
Got query for pns22.cloudns.net.|A from 127.0.0.1:60016, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:60016, took 75.397 usec
Got query for pns23.cloudns.net.|AAAA from 127.0.0.1:58081, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:58081, took 80.809 usec
Got query for pns23.cloudns.net.|A from 127.0.0.1:21071, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:21071, took 75.934 usec
Got query for pns24.cloudns.net.|AAAA from 127.0.0.1:60717, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:60717, took 69.05 usec
Got query for pns24.cloudns.net.|A from 127.0.0.1:24474, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:24474, took 76.035 usec
Closing TCP client connection with 127.0.0.1:18068

DNSDist 1.3.3:
====~
Got TCP connection from 127.0.0.1:52119
TCP connecting to downstream 127.0.0.1:5300 (0)
Got query for ns1.cainites.net.|AAAA from 127.0.0.1:19777, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:19777, took 102.211 usec
Got query for ns1.cainites.net.|A from 127.0.0.1:39181, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:39181, took 62.419 usec
Got query for ns21.cloudns.net.|AAAA from 127.0.0.1:49778, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:49778, took 73.866 usec
Got query for ns21.cloudns.net.|A from 127.0.0.1:32694, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:32694, took 65.213 usec
Got query for ns22.cloudns.net.|AAAA from 127.0.0.1:11961, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:11961, took 70.385 usec
Got query for ns22.cloudns.net.|A from 127.0.0.1:64406, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:64406, took 66.162 usec
Got query for ns23.cloudns.net.|AAAA from 127.0.0.1:22792, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:22792, took 66.027 usec
Got query for ns23.cloudns.net.|A from 127.0.0.1:45971, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:45971, took 59.044 usec
Got query for ns24.cloudns.net.|AAAA from 127.0.0.1:15728, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:15728, took 72.565 usec
Got query for ns24.cloudns.net.|A from 127.0.0.1:54837, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:54837, took 61.885 usec
Got query for pns21.cloudns.net.|AAAA from 127.0.0.1:23801, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:23801, took 74.133 usec
Got query for pns21.cloudns.net.|A from 127.0.0.1:27594, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:27594, took 71.445 usec
Got query for pns22.cloudns.net.|AAAA from 127.0.0.1:10776, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:10776, took 75.207 usec
Got query for pns22.cloudns.net.|A from 127.0.0.1:27766, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:27766, took 64.979 usec
Got query for pns23.cloudns.net.|AAAA from 127.0.0.1:54053, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:54053, took 70.983 usec
Got query for pns23.cloudns.net.|A from 127.0.0.1:14917, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:14917, took 70.164 usec
Got query for pns24.cloudns.net.|AAAA from 127.0.0.1:42298, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:42298, took 72.044 usec
Got query for pns24.cloudns.net.|A from 127.0.0.1:64711, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:64711, took 62.603 usec
Closing TCP client connection with 127.0.0.1:52119
[..]
Got TCP connection from 127.0.0.1:35337
TCP connecting to downstream 127.0.0.1:5300 (0)
Got query for ns1.cainites.net.|AAAA from 127.0.0.1:12219, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:12219, took 105.655 usec
Got query for ns1.cainites.net.|A from 127.0.0.1:53563, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:53563, took 64.825 usec
Got query for ns21.cloudns.net.|AAAA from 127.0.0.1:12250, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:12250, took 149.566 usec
Got query for ns21.cloudns.net.|A from 127.0.0.1:24992, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:24992, took 62.642 usec
Got query for ns22.cloudns.net.|AAAA from 127.0.0.1:27388, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:27388, took 69.855 usec
Got query for ns22.cloudns.net.|A from 127.0.0.1:31887, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:31887, took 70.517 usec
Got query for ns23.cloudns.net.|AAAA from 127.0.0.1:51832, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:51832, took 71.786 usec
Got query for ns23.cloudns.net.|A from 127.0.0.1:21974, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:21974, took 70.253 usec
Got query for ns24.cloudns.net.|AAAA from 127.0.0.1:47102, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:47102, took 72.278 usec
Got query for ns24.cloudns.net.|A from 127.0.0.1:34348, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:34348, took 62.43 usec
Got query for pns21.cloudns.net.|AAAA from 127.0.0.1:22683, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:22683, took 72.801 usec
Got query for pns21.cloudns.net.|A from 127.0.0.1:40313, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:40313, took 62.702 usec
Got query for pns22.cloudns.net.|AAAA from 127.0.0.1:44284, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:44284, took 73.505 usec
Got query for pns22.cloudns.net.|A from 127.0.0.1:27557, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:27557, took 66.494 usec
Got query for pns23.cloudns.net.|AAAA from 127.0.0.1:14161, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:14161, took 73.453 usec
Got query for pns23.cloudns.net.|A from 127.0.0.1:63576, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:63576, took 63.008 usec
Got query for pns24.cloudns.net.|AAAA from 127.0.0.1:14782, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:14782, took 71.751 usec
Got query for pns24.cloudns.net.|A from 127.0.0.1:63529, relayed to 127.0.0.1:5301
Got answer from 127.0.0.1:5301, relayed to 127.0.0.1:63529, took 62.112 usec
Closing TCP client connection with 127.0.0.1:35337

Other information

Usecase

Description

@rgacogne
Copy link
Member

@rgacogne rgacogne commented Jul 4, 2019

Thank you for this report! Would you mind providing your configuration? I'd like to know whether you have TCP Fast Open toward the backend enabled, for example.

TCP connecting to downstream 127.0.0.1:5300 (0)
Got an exception while handling (writing to backend) TCP query from 127.0.0.1:32323: Error while reading message: Socket is not connected
TCP connecting to downstream 127.0.0.1:5300 (1)

So this indicates that the first connection to the downstream server failed, but the second one apparently succeeded, so I'm afraid I have no clue about what went wrong based on this log :'( Would you be able to look at grepq to see if the query succeeded?
I'm a bit surprised by the error during the failed connection attempt, Socket is not connected means that read() returned ENOTCONN but I don't see how it's possible, any chance you could get a trace with truss at the moment this query is received?

@rgacogne
Copy link
Member

@rgacogne rgacogne commented Jul 5, 2019

After looking at the network traces, it looks like we open a TCP connection to the backend, write to it successfully but then the first call to read() returns ENOTCONN, which we don't handle. I'm a bit surprised that the write() would have succeeded in that case, though, so perhaps I'm missing something. Would you be able to try this 1 patch, by any chance?
I can see from your configuration that you don't enable TCP Fast Open, for the record.

@rgacogne
Copy link
Member

@rgacogne rgacogne commented Aug 9, 2019

I think I can trace this to what I believe is a bug in the FreeBSD kernel:

  • a TCP stream can start in the SS_ISCONNECTING so_state and remain in that state until the handshake has been completed ;
  • it has the PR_IMPLOPCL pr_flag set, meaning that "sendto and sendmsg is allowed" in that state, so sendmsg() will succeed even if the handshake has not been completed yet ;
  • but soreceive_stream() returns ENOTCONN when so_state is different from SS_ISCONNECTED or SS_ISDISCONNECTED, meaning read() will return ENOTCONN if the handshake has not been completed.

In soreceive_generic(), ENOTCONN is not returned if the so_state is SS_ISCONNECTING, so I believe this might be a bug in soreceive_stream(), the "Optimized version of soreceive() for stream (TCP) sockets".
I'll write a PoC to confirm my suspicion and update this comment.

@rgacogne
Copy link
Member

@rgacogne rgacogne commented Aug 9, 2019

Reproduced with net.inet.tcp.soreceive_stream=1 (default is 0) to enable the optimized version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants