Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-deploy to production #106

Closed
bickelj opened this issue Mar 25, 2024 · 11 comments
Closed

Auto-deploy to production #106

bickelj opened this issue Mar 25, 2024 · 11 comments
Assignees

Comments

@bickelj
Copy link
Collaborator

bickelj commented Mar 25, 2024

Last week I suggested it would not be a heavy lift to auto-deploy to production. The team requested it. In order to deploy to production more safely, I added PR #102 so that the deployment script can verify that deployment to test worked.

@bickelj bickelj self-assigned this Mar 25, 2024
bickelj added a commit that referenced this issue Mar 25, 2024
Without this change, someone needs to verify that the automated
deployment to the test environment succeeded and then manually type
two or three commands to deploy to production. At the moment, there is
only one person who has ever done such production deployments.

With this change, however, the deployment to the test environment gets
verified by visiting a URL that exposes the current deployed version.
When the body of a request sent to that URL returns the same string as
the version (tag) sent to this deployment job, this action can safely
conclude that the test deployment has succeeded. How and why can it
make such a haughty conclusion? Several reasons:

* The `deploy.sh` script saves the new version only on success.
* The server in the `reverse-proxy` container serves the version that
  the `deploy.sh` script wrote.
* The `reverse-proxy` container does not start until the `database`
  and `web` containers are running according to docker health checks.

In other words, accessing this version string implies that the state
of the environment is OK. If the version string matches in the test
environment, this gives some confidence that auto-deploying to the
production environment is relatively safe. Therefore it auto-deploys
to production. This means that merging pull requests in the `service`
repository is sufficient to cause automatic deployments to production
and at the same time the production deployment is safe due to the same
steps having run for a deployment to test. It means no more asking for
some person to deploy to production anymore!

Issue #106 Auto-deploy to production
bickelj added a commit that referenced this issue Mar 25, 2024
Without this change, someone needs to verify that the automated
deployment to the test environment succeeded and then manually type
two or three commands to deploy to production. At the moment, there is
only one person who has ever done such production deployments.

With this change, however, the deployment to the test environment gets
verified by visiting a URL that exposes the current deployed version.
When the body of a request sent to that URL returns the same string as
the version (tag) sent to this deployment job, this action can safely
conclude that the test deployment has succeeded. How and why can it
make such a haughty conclusion? Several reasons:

* The `deploy.sh` script saves the new version only on success.
* The server in the `reverse-proxy` container serves the version that
  the `deploy.sh` script wrote.
* The `reverse-proxy` container does not start until the `database`
  and `web` containers are running according to docker health checks.

In other words, accessing this version string implies that the state
of the environment is OK. If the version string matches in the test
environment, this gives some confidence that auto-deploying to the
production environment is relatively safe. Therefore it auto-deploys
to production. This means that merging pull requests in the `service`
repository is sufficient to cause automatic deployments to production
and at the same time the production deployment is safe due to the same
steps having run for a deployment to test. It means no more asking for
some person to deploy to production anymore!

Since "send tag to machine" is indecipherable, the action is renamed.

Issue #106 Auto-deploy to production
bickelj added a commit that referenced this issue Mar 27, 2024
Without this change, someone needs to verify that the automated
deployment to the test environment succeeded and then manually type
two or three commands to deploy to production. At the moment, there is
only one person who has ever done such production deployments.

With this change, however, the deployment to the test environment gets
verified by visiting a URL that exposes the current deployed version.
When the body of a request sent to that URL returns the same string as
the version (tag) sent to this deployment job, this action can safely
conclude that the test deployment has succeeded. How and why can it
make such a haughty conclusion? Several reasons:

* The `deploy.sh` script saves the new version only on success.
* The server in the `reverse-proxy` container serves the version that
  the `deploy.sh` script wrote.
* The `reverse-proxy` container does not start until the `database`
  and `web` containers are running according to docker health checks.

In other words, accessing this version string implies that the state
of the environment is OK. If the version string matches in the test
environment, this gives some confidence that auto-deploying to the
production environment is relatively safe. Therefore it auto-deploys
to production. This means that merging pull requests in the `service`
repository is sufficient to cause automatic deployments to production
and at the same time the production deployment is safe due to the same
steps having run for a deployment to test. It means no more asking for
some person to deploy to production anymore!

Since "send tag to machine" is indecipherable, the action is renamed.

Issue #106 Auto-deploy to production
bickelj added a commit that referenced this issue Mar 27, 2024
Without this change, someone needs to verify that the automated
deployment to the test environment succeeded and then manually type
two or three commands to deploy to production. At the moment, there is
only one person who has ever done such production deployments.

With this change, however, the deployment to the test environment gets
verified by visiting a URL that exposes the current deployed version.
When the body of a request sent to that URL returns the same string as
the version (tag) sent to this deployment job, this action can safely
conclude that the test deployment has succeeded. How and why can it
make such a haughty conclusion? Several reasons:

* The `deploy.sh` script saves the new version only on success.
* The server in the `reverse-proxy` container serves the version that
  the `deploy.sh` script wrote.
* The `reverse-proxy` container does not start until the `database`
  and `web` containers are running according to docker health checks.

In other words, accessing this version string implies that the state
of the environment is OK. If the version string matches in the test
environment, this gives some confidence that auto-deploying to the
production environment is relatively safe. Therefore it auto-deploys
to production. This means that merging pull requests in the `service`
repository is sufficient to cause automatic deployments to production
and at the same time the production deployment is safe due to the same
steps having run for a deployment to test. It means no more asking for
some person to deploy to production anymore!

Since "send tag to machine" is indecipherable, the action is renamed.

Issue #106 Auto-deploy to production
bickelj added a commit that referenced this issue Mar 28, 2024
Issue #106 Auto-deploy to production
@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

I just tested the guard against production deployment (by pushing a tag on a branch) and that part seemed to work, see https://github.com/PhilanthropyDataCommons/deploy/actions/runs/8468483885/job/23201622661:
deploy_guard

However the test deployment did not work.
Ah, it exited 4. What is that?
test ! -z "${KNOWN_HOSTS}" || exit 4
I suppose KNOWN_HOSTS needs to be passed in.

And I don't think the SSH stuff needs to be passed to the action that checks for a tag to be in main.

bickelj added a commit that referenced this issue Mar 28, 2024
Before this commit, the `KNOWN_HOSTS` variable was not sent to
`trigger_deployment.sh`, causing failures. It also seems inappropriate
to pass secrets to 3rd-party actions if avoidable, so this commit also
removes access to the secrets for the action that checks for the given
tag to be on the main branch.

Issue #106 Auto-deploy to production
@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

Oh, right, passing the same tag on the same (old) branch means the old workflow ran here: https://github.com/PhilanthropyDataCommons/deploy/actions/runs/8469064608/workflow, need to try a new tag on a rebased branch.

bickelj added a commit that referenced this issue Mar 28, 2024
Issue #106 Auto-deploy to production
@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

The trigger_deployment.sh script ran successfully, but the action that polls a URL does not seem to be working as expected.

Locally,

$ curl https://api-test.philanthropydatacommons.org/software-version
20240328-cb1e2ff-throwaway

And the action partially works as expected in that it asks every 20 seconds successfully, nginx logs on the test host:

nginx 14:50:44.90 INFO  ==> ** Starting NGINX **
me - - [28/Mar/2024:14:51:00 +0000] "GET /software-version HTTP/1.1" 200  27 "-" "curl/7.x.x" "-"
github - - [28/Mar/2024:14:51:01 +0000] "GET /software-version HTTP/1.1" 200  27 "-" "-" "-"
github - - [28/Mar/2024:14:51:21 +0000] "GET /software-version HTTP/1.1" 200  27 "-" "-" "-"
github - - [28/Mar/2024:14:51:41 +0000] "GET /software-version HTTP/1.1" 200  27 "-" "-" "-"
github - - [28/Mar/2024:14:52:01 +0000] "GET /software-version HTTP/1.1" 200  27 "-" "-" "-"
github - - [28/Mar/2024:14:52:21 +0000] "GET /software-version HTTP/1.1" 200  27 "-" "-" "-"

And it timed out successfully after 10 minutes.

However, it should have succeeded. I wonder if I passed the wrong variable name. There was some inconsistency in the documentation about snake_case versus camelCase.

Uh, this is odd:

Error: Expected body: 20240328-cb1e2ff-throwaway, actual body: 20240328-cb1e2ff-throwaway

Those look pretty identical to me but maybe there's a whitespace difference.

Yes, the presence of an empty line 17 means there's a newline while the expected body does not have the newline.

@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

Two options:

  1. Expect the newline
  2. Remove the newline

Because this version ends up in files on a GNU system, I think we should keep it and expect it (option 1).

@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

Surprise: it is not clear how to best represent a line terminator in YAML.

yaml.info says you can use escape characters in a double-quoted string.
yaml.org says 1.1 treats line break characters one way while 1.2 treats them another.

I'll try the straightforward addition of \n to the quoted string and see what happens.

bickelj added a commit that referenced this issue Mar 28, 2024
Without this change, the CI action that asks which version is running
in a target environment would fail because of a line terminator
difference in the expected string versus the returned string.

This change attempts to fix the issue by adding an escaped endline to
the expected response body.

Issue #106 Auto-deploy to production
@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

Oh no, the function that reads the expectedBody string trims whitespace.

Next try: expectBodyRegex.

bickelj added a commit that referenced this issue Mar 28, 2024
Without this change, the CI action that asks which version is running
in a target environment would fail because of a line terminator
difference in the expected string versus the returned string.

This change attempts to fix the issue by using a regular expression
that allows trailing whitespace. An attempt to add an escaped endline
to the expected response body fails because inputs are trimmed in the
action (at the time of this commit).

Issue #106 Auto-deploy to production
@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

The outstanding issues should be resolved. It is time to try the full pipeline beginning with a service repo merge.

Edit: OK, I was hoping to find an easy dependabot PR to merge in service but no dice. I'll do the intermediate thing: push a tag here in deploy. I added 20240328-457c494 and pushed it. Results here.

Hmm, main branch check is not as expected either.

It might be with the way it's doing a git checkout. I see fetch-tags: false in the checkout action.

Probably that's an issue. Reproduced locally like this, following (most of) the commands run by the checkout action:

git init throwaway_deploy
cd throwaway_deploy
git remote add origin https://github.com/PhilanthropyDataCommons/deploy
git config --local gc.auto 0
git -c protocol.version=2 fetch --no-tags --prune --no-recurse-submodules --depth=20 origin +457c494dc3c01eb726cf739be4a149666e11cc34:refs/tags/20240328-457c494
git log --graph --decorate --all
...

And I do not see main in there. Suppose I do it without the --no-tags and replace with --tags, then I get an error. But regardless there should be a way to include tags in checkout on the action.

bickelj added a commit that referenced this issue Mar 28, 2024
In order to check whether the checked-out tag is part of the main
branch, the tags (including branch names) need to be present for the
check to succeed.

Issue #106 Auto-deploy to production
bickelj added a commit that referenced this issue Mar 28, 2024
In order to check whether the checked-out tag is part of the main
branch, the tags (including branch names) need to be present for the
check to succeed.

Issue #106 Auto-deploy to production
@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

Ah, here is how it's supposed to look when the check-if-this-tag-is-in-that-branch action successfully determines that it does not exist in that branch:
test_tag_in_branch

The failures earlier were actual errors when running the action. This looks much better:

[action-contains-tag] Branch 'remotes/origin/main' does not contain tag '20240328-3e030bd-throwaway'.

@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

I pushed tag 20240328-b785d59 which should cause both a test and production deployment in sequence assuming test deployment works. If that all works, this issue should be resolved. It might still be nice to see it happen when triggered from the service repo, though.

Good:

[action-contains-tag] Branch 'remotes/origin/main' contains tag '20240328-b785d59'.

@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

Apparently I enabled a protection rule at https://github.com/PhilanthropyDataCommons/deploy/settings/environments/556150864/edit a long time ago:

Deployment protection rules

Configure reviewers, timers, and custom rules that must pass before deployments to this environment can proceed.
Required reviewers

Specify people or teams that may approve workflow runs when they access this environment.

I'm removing that review requirement.

@bickelj
Copy link
Collaborator Author

bickelj commented Mar 28, 2024

It should work fine. If not, re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done & Cleared
Development

No branches or pull requests

1 participant