Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACME needs to check revocation for unexpired certificates. #129838

Closed
kevincox opened this issue Jul 10, 2021 · 5 comments
Closed

ACME needs to check revocation for unexpired certificates. #129838

kevincox opened this issue Jul 10, 2021 · 5 comments

Comments

@kevincox
Copy link
Contributor

kevincox commented Jul 10, 2021

Describe the bug
Right now the acme service never contacts the outside world if the certificate expires more than validMinDays into the future. This behaviour is incorrect as the certificate may be revoked, or the CA may plan to revoke the certificate at any point before the actual expiry date. Therefore the CA must be contacted regularly so that the certificate can be renewed if required. This ensures that the certificate can be refreshed without downtime for a planned revocation or for minimal downtime upon unplanned revocation.

Additional context

https://github.com/NixOS/nixpkgs/blob/nixos-unstable/nixos/modules/security/acme.nix#L283-305

Planned revocation isn't yet specced out or implemented. But checking for revoked certs should be done.
https://mailarchive.ietf.org/arch/msg/acme/b-RddSX8TdGYvO3f9c7Lzg6I2I4/#

Notify maintainers
@roberth
@aanderse @andrew-d @arianvp @Emily @flokli @m1cr0man

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
# a list of nixos modules affected by the problem
module:
- security.acme
@kevincox kevincox changed the title ACME needs to contact the CA even for unexpired certificates. ACME needs to check revocation for unexpired certificates. Jul 10, 2021
@roberth
Copy link
Member

roberth commented Jul 10, 2021

This behavior was introduced in #114752 to work around network problems during or perhaps shortly after NixOS activation, causing deployment failures for no good reason.
To resolve this issue, it should try to contact the CA but ignore network-related errors.
Note that we are not supposed to retry, to avoid overloading the CA when it is under load. However, retries will happen as part of base load thanks to the timer unit.

Ideally we improve the activation (switch) logic but this is a hard problem, as pointed out by @arianvp #85794 (comment).

@kevincox
Copy link
Contributor Author

That sounds reasonable enough to me. It seems that it ideally would be some sort of failure still, but I guess you would get an email from the CA anyways so silently failing isn't a major issue. I agree that it shouldn't block deploys so this is probably the best we can do at the moment.

For now I think that checking an OSCP response if available every 24h is fine. We can revisit and use the CA-recommended values if the ACME Renewal Information (ARI) API goes anywhere.

@m1cr0man
Copy link
Contributor

Ideally, we don't want to be implementing API calls to CAs ourselves in the acme renew service script. It is complicated enough as is, and finding ways to solve issues like these without regressing continues to be a difficult task. At some point, we should probably start opening PRs to lego to get features we need implemented (Something I'm considering doing myself, at least).

For now, we can probably leave checking OCSP to lego (this would explain why it even needs an internet connection when certs are not expired), but we would still need to manually check before hand if the internet connection is working and the cert is not expired. Relying on systemd service ordering is not sufficient as explained in #85794 (comment) Here's a little truth table I've come up with:

Internet Cert Expired Run lego Triggers failure
No No No No
Yes No Yes No
No Yes Yes Yes
Yes Yes Yes No

As you can see, there's only one scenario here where we want to fail the service. We should continue to ignore OCSP if there is no active internet connection to avoid rebuild bugs as @roberth explained. However if the cert is expired we should run lego anyway so that users are informed of the issue.

@kevincox
Copy link
Contributor Author

I wonder if we could call Lego, and ignore failures if the cert still has expiry time left. That isn't optimal but maybe is good enough? It may be worth reaching out to the Lego devs and seeing if there is some sort of different error code for transient conditions like this.

I'll also look more into the service ordering. It seems like there should be a way to strongly bind to the DNS service. But maybe this isn't possible through a target and it would still result in the acme service being killed which would need to be handled somewhat gracefully.

@m1cr0man
Copy link
Contributor

m1cr0man commented Jul 10, 2021

I wonder if we could call Lego, and ignore failures if the cert still has expiry time left. That isn't optimal but maybe is good enough? It may be worth reaching out to the Lego devs and seeing if there is some sort of different error code for transient conditions like this.

Honestly I'm quite a fan of this idea! Why worry about cert failures pre-validMinDays? But yes it would be nice if lego could do this itself (via a flag) or returned a different error code.

I'll also look more into the service ordering. It seems like there should be a way to strongly bind to the DNS service. But maybe this isn't possible through a target and it would still result in the acme service being killed which would need to be handled somewhat gracefully.

If you find a solution to that one please update #106862 :) my last comment there has some thoughts on the matter. I did explore the dependency options quite thoroughly at the time but couldn't find a good match for our requirements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants