-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove per-instance crontab entries #2200
Conversation
# HOWEVER: these commands will NOT run on instances booted from the EE AMI | ||
# HOWEVER: these commands will NOT run on instances | ||
# booted from the EE AMI (i.e: WDIV installs running a "local" EE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've attempted to clarify things with this comment. Does this seem better?
cdk_stacks/stacks/command_runner.py
Outdated
# Restore staging from prod backup | ||
self.add_job( | ||
"restore_staging_from_s3", | ||
"cron(0 2 * * ? *)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this was in a crontab, I would write
0 2 * * *
The syntax used elsewhere for these jobs seems to imply this should be
0 2 * * ? *
Is there any documentation I can read to understand this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure there's some logic to this, but AWS deciding to use a slightly different scheduling syntax than UNIX CRON is baffling, and has lead to problems in the past (e.g you can't validate using crontab.guru)
@@ -74,7 +74,7 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: | |||
self.add_job( | |||
"add_nuts1_tags", | |||
"cron(15 2 * * ? *)", | |||
"""output-on-error ee-manage-py-command add_tags -u "https://s3.eu-west-2.amazonaws.com/ee.public.data/ons-cache/NUTS_Level_1_(January_2018)_Boundaries.gpkg" --fields '{"NUTS118NM": "value", "NUTS118CD": "key"}' --tag-name NUTS1 --is-gpkg""", | |||
"output-on-error /var/www/every_election/repo/serverscripts/add_nuts1.sh", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically runing the same command that is in this shell script.
Might as well use the shell script?
# If you want to have a command run only once per env / once globally, use SSM Run Command | ||
0 2 * * * every_election cd /var/www/every_election/repo && ./serverscripts/rebuild_local_db.sh | ||
30 2 * * * every_election cd /var/www/every_election/repo && ./serverscripts/add_nuts1.sh | ||
0 1 * * * every_election ee-manage-py-command dumpdata elections election_snooper --indent 4 -o /var/www/every_election/repo/every_election/data/elections.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dumpdata
command just doesn't need to run any more. Since #1991 it is not relevant. I will remove some other things relating to this, but that is another PR. Here I am just removing this command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR open at #2201
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't quite right: the sync elections management command will only sync...elections. That is, if we change a division set, add organisations, or do more or less anything else with the data model it's not going to update.
We still restore from a DB back up nightly on the embedded instances. sync_elections
is really for quick syncing of new elections, and e.g cancellations.
Also, yes, there is a problem if the data model in EE prod changes before the onboard EEs have had the code updated. That's such a rare case though, and one that doesn't happen just before elections, so based on our current traffic assumptions it's a risk we're ok with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Thanks for the clarification.
Given none of the commands in here run on embedded instances, the upshot for this PR is no change though, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry yes that's correct
cdk_stacks/stacks/command_runner.py
Outdated
@@ -91,6 +91,14 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None: | |||
f"output-on-error ee-manage-py-command export_ballots_as_wkt_csv --bucket 'ee.data-cache.{dc_environment}' --prefix 'ballots-with-wkt'", | |||
) | |||
|
|||
if dc_environment == "staging": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks right, but I've not actually tested any changes in this file. Is there any meaningful testing we can do on this other than deploy + monitor it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string value is enforced here. I think it's reasonable to trust that the string value is correct, but I guess we could pull valid_environments
out and share the values somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the changes to this file slightly more widely tbh. Just had to put the comment somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Yeah, not much else we can do apart from deploy and monitor I don't think. I also don't think we get notified of failed run_command
jobs...I saw some when you looked the other day that had failed, for example
Thinking about this more, I expect this was all caused by us copying the WDIV CDK model. Specifically, in WDIV we use replication, and here we don't. When using replication, you want to do something with the local DB before the instance serves traffic. When using a single remote postgres, you don't. I think I must have been confused between the two models. |
OK.. so upshot for this PR is: remove that line from |
Right. Just for future reference: This was kind of less bad than we first thought. So I removed that line from EveryElection/cdk_stacks/stacks/command_runner.py Lines 94 to 99 in 38ad35a
wrong because we really want staging to run that against the staging RDS, not the local DB. Then I realised there is no staging RDS (or at least no staging DB connection credentials). Probably no dev RDS either. |
I reckon removing the command from Its probably getting a 500 from the healthcheck endpoint or whatever because the (local) DB is empty |
We agreed that what we would do is:
So here's a quick summary what I've done, just because a bunch of stuff has happened outside of code/github: I created 2 DBs on the shared development RDS called Then I restored a dump from prod into each of those 2 DBs. Then I made 2 users called called Then I ran:
..and the equivalent for the (may not have actually happened in that exact order, but that's the most coherent way to present it 😄 ) One of the reasons I mention this is I suspect this is not identical to the way the other DBs/users on this host are set up. Then in AWS Parameter Store I set the
environment variables for each of the two environments (dev and staging). Dev is confirmed working deployed with those settings. We haven't yet deployed to staging, but should deploy successfully (now that I've got the permissions right on dev) but is as yet untested. As another issue, EE staging can not be accessed due to a CloudFront configuration issue, but that is another issue for another time. |
I'm going to deploy this one first to make sure the staging RDS setup is fine. Then I will do #2198 second once I've checked that |
a0dae60
to
b8c57f5
Compare
Right. So I did some digging.
All of these commands actually are running on the EE prod server(s).
EE has been restoring itself from a copy of its own backup every night 🙀
I don't think it was the intent that these jobs should have been running on WDIV servers running their own EE instance. We definitely wouldn't have wanted them running
snoop
for example, andsync_elections
is there as an alternative to restoring from backup every night.Also a bit worried that we have no log aggregation and not really any visibility on these cron jobs, but I'm not going to do anything about those problems here.
This is one of those PRs that asks a lot of questions as well as answering some, so I'm going to leave some comments inline.