Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove per-instance crontab entries #2200

Merged
merged 5 commits into from
Aug 13, 2024
Merged

remove per-instance crontab entries #2200

merged 5 commits into from
Aug 13, 2024

Conversation

chris48s
Copy link
Member

Right. So I did some digging.
All of these commands actually are running on the EE prod server(s).
EE has been restoring itself from a copy of its own backup every night 🙀

I don't think it was the intent that these jobs should have been running on WDIV servers running their own EE instance. We definitely wouldn't have wanted them running snoop for example, and sync_elections is there as an alternative to restoring from backup every night.

Also a bit worried that we have no log aggregation and not really any visibility on these cron jobs, but I'm not going to do anything about those problems here.

This is one of those PRs that asks a lot of questions as well as answering some, so I'm going to leave some comments inline.

Comment on lines -2 to +3
# HOWEVER: these commands will NOT run on instances booted from the EE AMI
# HOWEVER: these commands will NOT run on instances
# booted from the EE AMI (i.e: WDIV installs running a "local" EE)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've attempted to clarify things with this comment. Does this seem better?

# Restore staging from prod backup
self.add_job(
"restore_staging_from_s3",
"cron(0 2 * * ? *)",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this was in a crontab, I would write

0 2 * * *

The syntax used elsewhere for these jobs seems to imply this should be

0 2 * * ? *

Is there any documentation I can read to understand this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

I'm sure there's some logic to this, but AWS deciding to use a slightly different scheduling syntax than UNIX CRON is baffling, and has lead to problems in the past (e.g you can't validate using crontab.guru)

@@ -74,7 +74,7 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
self.add_job(
"add_nuts1_tags",
"cron(15 2 * * ? *)",
"""output-on-error ee-manage-py-command add_tags -u "https://s3.eu-west-2.amazonaws.com/ee.public.data/ons-cache/NUTS_Level_1_(January_2018)_Boundaries.gpkg" --fields '{"NUTS118NM": "value", "NUTS118CD": "key"}' --tag-name NUTS1 --is-gpkg""",
"output-on-error /var/www/every_election/repo/serverscripts/add_nuts1.sh",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically runing the same command that is in this shell script.
Might as well use the shell script?

# If you want to have a command run only once per env / once globally, use SSM Run Command
0 2 * * * every_election cd /var/www/every_election/repo && ./serverscripts/rebuild_local_db.sh
30 2 * * * every_election cd /var/www/every_election/repo && ./serverscripts/add_nuts1.sh
0 1 * * * every_election ee-manage-py-command dumpdata elections election_snooper --indent 4 -o /var/www/every_election/repo/every_election/data/elections.json
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dumpdata command just doesn't need to run any more. Since #1991 it is not relevant. I will remove some other things relating to this, but that is another PR. Here I am just removing this command.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR open at #2201

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite right: the sync elections management command will only sync...elections. That is, if we change a division set, add organisations, or do more or less anything else with the data model it's not going to update.

We still restore from a DB back up nightly on the embedded instances. sync_elections is really for quick syncing of new elections, and e.g cancellations.

Also, yes, there is a problem if the data model in EE prod changes before the onboard EEs have had the code updated. That's such a rare case though, and one that doesn't happen just before elections, so based on our current traffic assumptions it's a risk we're ok with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Thanks for the clarification.
Given none of the commands in here run on embedded instances, the upshot for this PR is no change though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry yes that's correct

@@ -91,6 +91,14 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
f"output-on-error ee-manage-py-command export_ballots_as_wkt_csv --bucket 'ee.data-cache.{dc_environment}' --prefix 'ballots-with-wkt'",
)

if dc_environment == "staging":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right, but I've not actually tested any changes in this file. Is there any meaningful testing we can do on this other than deploy + monitor it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string value is enforced here. I think it's reasonable to trust that the string value is correct, but I guess we could pull valid_environments out and share the values somehow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the changes to this file slightly more widely tbh. Just had to put the comment somewhere

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Yeah, not much else we can do apart from deploy and monitor I don't think. I also don't think we get notified of failed run_command jobs...I saw some when you looked the other day that had failed, for example

@chris48s
Copy link
Member Author

The other big worry I have here is

- location: serverscripts/rebuild_local_db.sh
timeout: 900
runas: every_election

When are these command run? Are we runing this on every deploy, including to production??
If it helps, the original commit doing all this was e48dd91

@chris48s chris48s requested a review from symroe July 30, 2024 14:41
@symroe
Copy link
Member

symroe commented Jul 31, 2024

The other big worry I have here is

- location: serverscripts/rebuild_local_db.sh
timeout: 900
runas: every_election

When are these command run? Are we runing this on every deploy, including to production?? If it helps, the original commit doing all this was e48dd91

Thinking about this more, I expect this was all caused by us copying the WDIV CDK model. Specifically, in WDIV we use replication, and here we don't.

When using replication, you want to do something with the local DB before the instance serves traffic. When using a single remote postgres, you don't. I think I must have been confused between the two models.

@chris48s
Copy link
Member Author

I must have been confused between the two models.

OK.. so upshot for this PR is: remove that line from appspec.yml then?

@coveralls
Copy link

coveralls commented Jul 31, 2024

Coverage Status

coverage: 70.675% (+0.001%) from 70.674%
when pulling 981ddb6 on crontab20243007
into d49959b on master.

@chris48s
Copy link
Member Author

chris48s commented Jul 31, 2024

Right.

Just for future reference: This was kind of less bad than we first thought.
(For reasons) as well as having a shared RDS instance the EE VMs also have a local postgres install. The rebuild_local_db script was running against the local Postgres install, not the shared RDS.


So I removed that line from appspec.yml. My next thought was that this also makes

if dc_environment == "staging":
# Restore staging from prod backup
self.add_job(
"restore_staging_from_s3",
"cron(0 2 * * ? *)",
"output-on-error /var/www/every_election/repo/serverscripts/rebuild_local_db.sh",

wrong because we really want staging to run that against the staging RDS, not the local DB.

Then I realised there is no staging RDS (or at least no staging DB connection credentials). Probably no dev RDS either.
So maybe this is actually sort of right
I think the next thing is: Do we think it is right that dev/staging are using a local DB not RDS?

@chris48s
Copy link
Member Author

I reckon removing the command from appspec.yml is why the deploy has completely hung as well.

Its probably getting a 500 from the healthcheck endpoint or whatever because the (local) DB is empty

@chris48s
Copy link
Member Author

chris48s commented Aug 8, 2024

We agreed that what we would do is:

  • Set EE staging to use RDS instead of local postgres
  • Set EE dev to use RDS instead of local postgres
  • Do a one-off import of data into those two DBs (and leave daily syncing data from prod for another time)
  • Remove use of rebuild_local_db.sh in this PR completely

So here's a quick summary what I've done, just because a bunch of stuff has happened outside of code/github:

I created 2 DBs on the shared development RDS called ee_dev and ee_staging.

Then I restored a dump from prod into each of those 2 DBs.

Then I made 2 users called called ee_dev and ee_staging.

Then I ran:

ee_dev=> GRANT CONNECT ON DATABASE ee_dev TO ee_dev;
ee_dev=> GRANT USAGE ON SCHEMA public TO ee_dev;
ee_dev=> GRANT CREATE ON SCHEMA public TO ee_dev;
ee_dev=> GRANT ALL PRIVILEGES ON DATABASE ee_dev TO ee_dev;
ee_dev=> GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ee_dev;
ee_dev=> ALTER DATABASE ee_dev OWNER TO ee_dev;

..and the equivalent for the ee_staging user/DB.

(may not have actually happened in that exact order, but that's the most coherent way to present it 😄 )

One of the reasons I mention this is I suspect this is not identical to the way the other DBs/users on this host are set up.

Then in AWS Parameter Store I set the

EE_DATABASE_HOST
EE_DATABASE_NAME
EE_DATABASE_USER
EE_DATABASE_PASSWORD

environment variables for each of the two environments (dev and staging). Dev is confirmed working deployed with those settings. We haven't yet deployed to staging, but should deploy successfully (now that I've got the permissions right on dev) but is as yet untested. As another issue, EE staging can not be accessed due to a CloudFront configuration issue, but that is another issue for another time.

@chris48s
Copy link
Member Author

I'm going to deploy this one first to make sure the staging RDS setup is fine. Then I will do #2198 second once I've checked that

@chris48s chris48s merged commit 4e0452b into master Aug 13, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants