Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'certify' doesn't update existing certificates #590

Closed
Miserlou opened this issue Jan 10, 2017 · 41 comments
Closed

'certify' doesn't update existing certificates #590

Miserlou opened this issue Jan 10, 2017 · 41 comments

Comments

@Miserlou
Copy link
Owner

Somewhere, a regression was introduced, I think this was from a PR that I was iffy about.

Repro:
Certify on a new name

Calling certify for environment prod..
Certifying domain blah.yourname.io..
Setting DNS challenge..
Waiting for DNS to propagate..
Deleting DNS challenge..
Updating domain name!
Certificate updated!

Go to domain, old cert still in production.

@Miserlou
Copy link
Owner Author

Possibly related:

#458
#334

Could also be a change in APIGW behavior itself.

@chromakey
Copy link

chromakey commented Jan 10, 2017

To +1 this: it seems that none of my certificates have been auto-renewing. I've been having to do it by manually deleting the custom domain name in APIGW and re-deploying/re-certifying. Here's an example zappa_settings.json that I've been using:

{ "production": { "app_function": "flask_server.app", "domain": "www.domain.com", "lets_encrypt_key": "../keys/account.key", "lets_encrypt_expression": "rate(1 day)", "s3_bucket": "proj-prod-zappa", "profile_name": "lambda-deploy", "timeout_seconds": 90 } }

EDIT: Should clarify that while troubleshooting this, I attempted to re-run the certify command with no luck. Running on zappa 0.32.1

@Miserlou
Copy link
Owner Author

I think this is actually related: #391

@kwellman
Copy link
Contributor

+1

This is happening to me too. I noticed that when I run zappa certify the certificate name in api gateway does get updated (e.g. I see the updated timestamped name in the aws console), but the certificate doesn't get updated when I check in my web browser.

I wonder if it's related to this code in the update_domain_name method of zappa.py:

        update_domain_name_response = self.apigateway_client.update_domain_name(
            domainName=domain_name,
            patchOperations=[
                {
                    'op': 'replace',
                    'path': '/certificateName',
                    'value': new_cert_name,
                }
            ]
        )

Is this code just changing the certificate name? Maybe we need to pass the certificate body, etc... to this method. Something like:

      patchOperations=[
                {
                    'op': 'replace',
                    'path': '/certificateName',
                    'value': new_cert_name,
                }
            {"op": "replace",
                "path": "/certificateBody",
                "value": certificate_body},
            {"op": "replace",
                "path": "/certificatePrivateKey",
                "value": certificate_private_key},
            {"op": "replace",
                "path": "/certificateChain",
                "value": certificate_chain},
        ]

...with better formatting of course. I got this code searching Github so it's likely supported. I don't have time to test this at the moment, but within the next week I could try it unless this issue is resolved or someone else can try it.

@kwellman
Copy link
Contributor

Unfortunately, this is not possible like I had hoped. Only the certificate name can be updated (https://docs.aws.amazon.com/apigateway/api-reference/link-relation/domainname-update/). I suppose this problem was known when originally writing the update_domain_name method for Zappa.

I'm not sure how to get around this programmatically, short of creating a new api gateway domain with the new certificate, which may involve some downtime if we have to delete the current api gateway domain first. Hopefully, there is a better way. I will test with an old version of Zappa to determine when this regression occurred, although I always had problems with re-certifying my domain.

Currently, I'm updating my certificate by using the undocumented Zappa features to output the certificate data, and pasting that data into the aws console as described at http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-custom-domains.html#how-to-rotate-custom-domain-certificate.

P.S: The Terraform project also encountered this problem (hashicorp/terraform#8789)

@chromakey
Copy link

chromakey commented Jan 16, 2017

Yeah I'm not sure how to go about this. Just by virtue of the way APIGW/CF works, any time the certificate is changed on the fly, there's going to be downtime while it's pushed out to the edge nodes in CF.

The AWS docs have something on uploading a backup certificate and then rotating it into production gradually to avoid the downtime issue. I don't know what aspect of this are exposed in the API, but maybe this will give us some clues on where to start looking.

http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-custom-domains.html#how-to-rotate-custom-domain-certificate

EDIT:
The example here has you update the certificate NAME by specifying the /certificateName path:
http://docs.aws.amazon.com/apigateway/api-reference/link-relation/domainname-update/

I wonder if you could use the same method to update the certificateBody, certificatePrivateKey, and certificateChain parameters by specifying them as paths (i.e. /certificateBody)
http://boto3.readthedocs.io/en/latest/reference/services/apigateway.html#APIGateway.Client.create_domain_name

Just guessing blindly. Worth a shot maybe.

@kwellman
Copy link
Contributor

@chromakey

If only it were that simple...I tried that previously but it didn't work. This is the output I got:

Setting DNS challenge..
Waiting for DNS to propagate..
Deleting DNS challenge..
Updating domain name!
An error occurred (BadRequestException) when calling the UpdateDomainName operation: Invalid patch path  '/certifica
teBody' specified for op 'replace'. Must be one of: [/certificateName]
Failed to generate or install certificate! :(

@chromakey
Copy link

chromakey commented Jan 16, 2017

Ah it was worth a shot. I can't find anything in the documentation that points to the backup certificate functionality being exposed at all. The best we could hope for I think is that maybe it's undocumented and uses parameters like backupCertificateBody or something like that.

Supposedly, AWS has been working on getting ACM to work with API Gateway, but there's no public time table at the moment:
https://forums.aws.amazon.com/thread.jspa?threadID=234686

Maybe we should disable the renewal feature for the time being since from what I can tell it doesn't work as intended and it might catch people off guard when their 90 days is up (speaking from experience) or at least add a warning of sorts.

I'll keep looking as time allows, but unfortunately it doesn't look good at the moment.

EDIT: Well there it is in black and white, I just missed it the first time in the APIGW docs - "You cannot rotate custom domain name certificates programmatically."

@kwellman
Copy link
Contributor

Personally I'm okay with renewing the certificate manually through the AWS console. After all, it's something you only have to do after 90 days. Of course, this should be made very clear in the documentation along with detailed instructions.

One possible option to get the certificate updated programmatically would be to generate a new api gateway domain specifically for the new certificate and then switch the route53 domain to use the new api gateway. Eventually, after a sufficient amount of time for everything to propagate we'd delete the old api gateway. It's complicated but I believe this could all be done automatically in code. Just an idea. Of course, a simpler way would be much preferred.

@aectann
Copy link
Collaborator

aectann commented Jan 24, 2017

@Miserlou Had to renew all our certificates manually today (4 out of 5 expired). So it looks like it was not working properly for a while?

@Miserlou
Copy link
Owner Author

One of the issues was that the documentation was wrong, sorry about that. It's lets_encrypt_expression not lets_encrypt_schedule. There is also a potential race condition that still needs to be addressed.

@aectann
Copy link
Collaborator

aectann commented Jan 24, 2017

@Miserlou hmm.. but that would only affect the creation of a cloudwatch trigger, won't it? I did see certificates expired even with the trigger present.

@aectann
Copy link
Collaborator

aectann commented Jan 24, 2017

Also, I did try to just run certify locally with no effect on the certificate (expired one wasn't replaced).

@Miserlou
Copy link
Owner Author

Yeah, it needs to be chased down. I think there was a PR that introduced this bug although I can't confirm that. I think this is related: #391

@NathanLawrence
Copy link

Welp...this one just came around for me again on a different project. I'm really eager to help out with any solutions on this front. What can we do?

@Miserlou
Copy link
Owner Author

First step is diagnosis!

@NathanLawrence
Copy link

Roger. I'm going to crack down on this one bigtime, so I should have a PR by week's end.

@Miserlou
Copy link
Owner Author

Fantastic news.

@NathanLawrence
Copy link

All right, now that I'm actually getting down to business here, I can't find any way to recreate the issues people are concerned about on a fresh project. It looks to me like when certify is called, the certificate is correctly replaced with a new one, and the ANSI date in its same name seems to confirm this.

Apart from how long it takes for new certs to reach edge nodes, which is not ours to change, what are we looking to improve here?

@Miserlou
Copy link
Owner Author

How long did it take your update to reach "edge nodes"?

@NathanLawrence
Copy link

It hasn't yet, as far as I can tell, and that's a good point -- until it arrives and that expiration date changes, I really won't have any way to tell for certain if we're sending through fresh certs or just changing the name displayed on the AWS UI and nothing else.

@Miserlou
Copy link
Owner Author

Bump - did it propagate?

@NathanLawrence
Copy link

Looks like it!

Period of Validity
Begins on: February 3, 2017
Expires on: May 4, 2017

@pjz
Copy link
Contributor

pjz commented Feb 27, 2017

So now if I have an existing site that didn't update, what do I do?

@NathanLawrence
Copy link

Well nothing this second since Amazon S3 is down, but...

I would update your local tools and try running certify again. If that doesn't fix the problem, it may be time for more drastic measures. You can always undeploy, go through the admin interface to make sure everything is gone, then deploy and certify like it's fresh.

@Miserlou
Copy link
Owner Author

Miserlou commented Mar 3, 2017

I went around this in a giant circle today. Head -> Wall -> bang

New theory is we can remove the API Gateway entry for the domain and recreating it but leave the Route53 domain name in tact, updating the record. Expecting this will cause down time between 0 and 40 minutes. >:[

@chromakey
Copy link

Yeah given how AWS chose to not expose the backup certificate functionality in the API for APIGW, I don't see any other way. Supposedly they're going to integrate ACM with APIGW at some point, but no one knows if that's tomorrow or 10 years from now.

@maxtheman
Copy link

maxtheman commented Mar 3, 2017

I've been writing a little side project (for work) in flask using zappa, so this is my first go-around running zappa certify, and three weird things happened when I ran it which broke my project.

What I did:

  1. First tried to deploy/certify a totally new environment called prod. Certify ran "successfully" and generating the correct (I think) records in the api gateway for custom domain. It did not create any records in route 53. Running certify for the environment again produced issue referenced in How to delete the CloudFront distribution created by certify? #334.

  2. Attempting to access my index url in prod results in a redirect from {{aws_url}}/prod/ => {{aws_url}}/ and a resulting 403 forbidden is thrown, I'm assuming from the Cloudflare distribution (which doesn't exist in my Cloudflare dashboard, but does in the APIGW, oddly enough)

  3. If I go to one of my APIGW endpoints that should be accessible (say, prod/login) I am served the html, but not any of the static files, including the linked css (although the css shows up in dev tools, it's an empty file).

Note that this is the first update to prod I attempted after upgrading zappa to the most recent version (0.37.2) and the only certify I've done.

Fixes I've attempted that didn't work:

  • Undeploy/deploying per @NathanLawrence 's comment above doesn't seem to fix any of this, including issue three.
  • Waiting 1hr+ for the certify to go through

Steps for fix I attempted that did work:

  1. Undeploying, Removing "domain" from zappa_settings.json
  2. Removing the custom domain associated with dev in APIGW
  3. Redeploying immediately fixed (2) and (3)

Let me know if you have questions or trouble reproducing any of this. I need to get it done, so I'll just setup the custom domains by hand for now.

PS -- Sorry if I'm overloading this single issue, happy to split off into a different thread if appropriate.

@Miserlou
Copy link
Owner Author

Miserlou commented Mar 3, 2017

So I just tried my proposed solution up there ^, expecting a between 0 and 40 minute downtime, and it seems like the change is in fact instantaneous. So, it's not as elegant as actually rotating the certificate, but the damage isn't as bad as I imagined. <1 minute * 4 times a year - that's still quite a lot of 9's of uptime.

@Miserlou
Copy link
Owner Author

Miserlou commented Mar 3, 2017

Update: This does NOT work as intended. Gosh dang it.

It works, I was able to see the page at the domain, immediately after calling certify, THEN the distribution goes down, once the new distribution has been created.

So, as it stands, we can either have it work immediately, but then it breaks 40 minutes later, OR we can have it break immediately but it works 40 minutes later. Very annoying!

Obviously the latter is preferred, but is there an opportunity for hacks here? Can we do some kind of domain name dance to ensure uptime?

@pjz
Copy link
Contributor

pjz commented Mar 5, 2017

What's taking the 40 minutes? API Gateway? or route53? Is it possible to leave the old one in place but make a new one and then make the changeover but not nuke the old one for another 40 minutes? Basically: keep them both in existence until the changeover time is complete, so there's never downtime - the site is always served with either the old one or the new one.

@Miserlou
Copy link
Owner Author

Miserlou commented Mar 5, 2017

That'd be ideal, but would require either an extremely long running process, or we set up a scheduled function to do it from AWS, which might not work for everybody's setup.

In the meantime, I've just added a --manual mode mode so you can rotate the cert without any downtime.

@pjz
Copy link
Contributor

pjz commented Mar 5, 2017

This is ideally all happening via the recertify job, right? So it's already a long running process?

@kwellman
Copy link
Contributor

kwellman commented Mar 6, 2017

Unless I'm missing something, it seems like something that could be done with no downtime. In the certify handler set up a new API gateway, leave the old one, point the route 53 domain to the new API gateway. The old API gateway could be deleted the next time the certify handler is called.

@chromakey
Copy link

Will this work since APIGW doesn't allow you to have two of the same custom domain names in the system at any given time?

@kwellman
Copy link
Contributor

kwellman commented Mar 6, 2017

That's a very good point. Also, now that i think about it we would also need to ensure that the API gateway is fully propagated before pointing the route53 domain to it to guarantee no downtime.

@chromakey
Copy link

To give a sense of my manual process to work around this for the moment, here's what I've had to do.

  1. Delete the domain via the console from APIGW.
  2. Wait about 20 minutes (if you try to do something before the CloudFront edge nodes are updated, it will throw an error saying the domain already exists)
  3. Re-certify my project using zappa certify production (or equivalent).
  4. Wait another 20 minutes.

CloudFront warns you to expect that domain deletion/creation can take up to 40 minutes each way, but it's been my experience that it's usually in the 20-25 minute range. I have about 15 sites using Zappa at the moment (thankfully none that are SLA sensitive) and this is the method I've used successfully since we discovered the issue.

I'm not sure how we can clear this hurdle without:

  1. AWS exposing certificate rotation for APIGW to the API.
  2. AWS allowing ACM to run with APIGW.
  3. AWS allowing multiple instances of the same custom domain name (to use as a placeholder).
  4. Something else?

@Miserlou
Copy link
Owner Author

Miserlou commented Mar 6, 2017

@chromakey - if your current strategy already involves manual deletion, you can use the new --manual feature to use the console to update the cert to avoid downtime. The updated certify behavior will do a (reordered) version of your steps automatically.

@chromakey
Copy link

Sweet. I'll take it for a test drive in the next couple days.

@Miserlou Miserlou closed this as completed Mar 8, 2017
@chromakey
Copy link

Haven't looked into this yet, but looks like we may have a path to success

https://aws.amazon.com/about-aws/whats-new/2017/03/amazon-api-gateway-integrates-with-aws-certificate-manager-acm/

@Miserlou
Copy link
Owner Author

Yep! Currently open for discussion here: #710

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants