remove per-instance crontab entries #2200

chris48s · 2024-07-30T14:39:08Z

Right. So I did some digging.
All of these commands actually are running on the EE prod server(s).
EE has been restoring itself from a copy of its own backup every night 🙀

I don't think it was the intent that these jobs should have been running on WDIV servers running their own EE instance. We definitely wouldn't have wanted them running snoop for example, and sync_elections is there as an alternative to restoring from backup every night.

Also a bit worried that we have no log aggregation and not really any visibility on these cron jobs, but I'm not going to do anything about those problems here.

This is one of those PRs that asks a lot of questions as well as answering some, so I'm going to leave some comments inline.

chris48s · 2024-07-30T14:39:27Z

deployscripts/crontab

-# HOWEVER: these commands will NOT run on instances booted from the EE AMI
+# HOWEVER: these commands will NOT run on instances
+# booted from the EE AMI (i.e: WDIV installs running a "local" EE)


I've attempted to clarify things with this comment. Does this seem better?

chris48s · 2024-07-30T14:39:37Z

cdk_stacks/stacks/command_runner.py

+            # Restore staging from prod backup
+            self.add_job(
+                "restore_staging_from_s3",
+                "cron(0 2 * * ? *)",


If this was in a crontab, I would write

0 2 * * *

The syntax used elsewhere for these jobs seems to imply this should be

0 2 * * ? *

Is there any documentation I can read to understand this?

Yes

I'm sure there's some logic to this, but AWS deciding to use a slightly different scheduling syntax than UNIX CRON is baffling, and has lead to problems in the past (e.g you can't validate using crontab.guru)

chris48s · 2024-07-30T14:39:46Z

cdk_stacks/stacks/command_runner.py

@@ -74,7 +74,7 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
            self.add_job(
                "add_nuts1_tags",
                "cron(15 2 * * ? *)",
-                """output-on-error ee-manage-py-command add_tags -u "https://s3.eu-west-2.amazonaws.com/ee.public.data/ons-cache/NUTS_Level_1_(January_2018)_Boundaries.gpkg" --fields '{"NUTS118NM": "value", "NUTS118CD": "key"}' --tag-name NUTS1 --is-gpkg""",
+                "output-on-error /var/www/every_election/repo/serverscripts/add_nuts1.sh",


This is basically runing the same command that is in this shell script.
Might as well use the shell script?

chris48s · 2024-07-30T14:40:11Z

deployscripts/crontab

 # If you want to have a command run only once per env / once globally, use SSM Run Command
-0 2 * * * every_election cd /var/www/every_election/repo && ./serverscripts/rebuild_local_db.sh
-30 2 * * * every_election cd /var/www/every_election/repo && ./serverscripts/add_nuts1.sh
-0 1 * * * every_election ee-manage-py-command dumpdata elections election_snooper --indent 4 -o /var/www/every_election/repo/every_election/data/elections.json


This dumpdata command just doesn't need to run any more. Since #1991 it is not relevant. I will remove some other things relating to this, but that is another PR. Here I am just removing this command.

PR open at #2201

This isn't quite right: the sync elections management command will only sync...elections. That is, if we change a division set, add organisations, or do more or less anything else with the data model it's not going to update.

We still restore from a DB back up nightly on the embedded instances. sync_elections is really for quick syncing of new elections, and e.g cancellations.

Also, yes, there is a problem if the data model in EE prod changes before the onboard EEs have had the code updated. That's such a rare case though, and one that doesn't happen just before elections, so based on our current traffic assumptions it's a risk we're ok with

OK. Thanks for the clarification.
Given none of the commands in here run on embedded instances, the upshot for this PR is no change though, right?

Sorry yes that's correct

chris48s · 2024-07-30T14:40:31Z

cdk_stacks/stacks/command_runner.py

@@ -91,6 +91,14 @@ def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
                f"output-on-error ee-manage-py-command export_ballots_as_wkt_csv --bucket 'ee.data-cache.{dc_environment}' --prefix 'ballots-with-wkt'",
            )

+        if dc_environment == "staging":


This looks right, but I've not actually tested any changes in this file. Is there any meaningful testing we can do on this other than deploy + monitor it?

The string value is enforced here. I think it's reasonable to trust that the string value is correct, but I guess we could pull valid_environments out and share the values somehow.

I was thinking about the changes to this file slightly more widely tbh. Just had to put the comment somewhere

Right. Yeah, not much else we can do apart from deploy and monitor I don't think. I also don't think we get notified of failed run_command jobs...I saw some when you looked the other day that had failed, for example

chris48s · 2024-07-30T14:40:58Z

The other big worry I have here is

EveryElection/appspec.yml

Lines 85 to 87 in 9caa6c3

    
           - location: serverscripts/rebuild_local_db.sh 
        
             timeout: 900 
        
             runas: every_election

When are these command run? Are we runing this on every deploy, including to production??
If it helps, the original commit doing all this was e48dd91

symroe · 2024-07-31T06:59:24Z

The other big worry I have here is

EveryElection/appspec.yml

Lines 85 to 87 in 9caa6c3

- location: serverscripts/rebuild_local_db.sh

timeout: 900

runas: every_election

When are these command run? Are we runing this on every deploy, including to production?? If it helps, the original commit doing all this was e48dd91

Thinking about this more, I expect this was all caused by us copying the WDIV CDK model. Specifically, in WDIV we use replication, and here we don't.

When using replication, you want to do something with the local DB before the instance serves traffic. When using a single remote postgres, you don't. I think I must have been confused between the two models.

chris48s · 2024-07-31T07:15:07Z

I must have been confused between the two models.

OK.. so upshot for this PR is: remove that line from appspec.yml then?

coveralls · 2024-07-31T11:57:26Z

coverage: 70.675% (+0.001%) from 70.674%
when pulling 981ddb6 on crontab20243007
into d49959b on master.

chris48s · 2024-07-31T12:41:56Z

Right.

Just for future reference: This was kind of less bad than we first thought.
(For reasons) as well as having a shared RDS instance the EE VMs also have a local postgres install. The rebuild_local_db script was running against the local Postgres install, not the shared RDS.

So I removed that line from appspec.yml. My next thought was that this also makes

EveryElection/cdk_stacks/stacks/command_runner.py

Lines 94 to 99 in 38ad35a

    
           if dc_environment == "staging": 
        
               # Restore staging from prod backup 
        
               self.add_job( 
        
                   "restore_staging_from_s3", 
        
                   "cron(0 2 * * ? *)", 
        
                   "output-on-error /var/www/every_election/repo/serverscripts/rebuild_local_db.sh",

wrong because we really want staging to run that against the staging RDS, not the local DB.

Then I realised there is no staging RDS (or at least no staging DB connection credentials). Probably no dev RDS either.
So maybe this is actually sort of right
I think the next thing is: Do we think it is right that dev/staging are using a local DB not RDS?

chris48s · 2024-07-31T12:51:52Z

I reckon removing the command from appspec.yml is why the deploy has completely hung as well.

Its probably getting a 500 from the healthcheck endpoint or whatever because the (local) DB is empty

chris48s · 2024-08-08T09:15:18Z

We agreed that what we would do is:

Set EE staging to use RDS instead of local postgres
Set EE dev to use RDS instead of local postgres
Do a one-off import of data into those two DBs (and leave daily syncing data from prod for another time)
Remove use of rebuild_local_db.sh in this PR completely

So here's a quick summary what I've done, just because a bunch of stuff has happened outside of code/github:

I created 2 DBs on the shared development RDS called ee_dev and ee_staging.

Then I restored a dump from prod into each of those 2 DBs.

Then I made 2 users called called ee_dev and ee_staging.

Then I ran:

ee_dev=> GRANT CONNECT ON DATABASE ee_dev TO ee_dev;
ee_dev=> GRANT USAGE ON SCHEMA public TO ee_dev;
ee_dev=> GRANT CREATE ON SCHEMA public TO ee_dev;
ee_dev=> GRANT ALL PRIVILEGES ON DATABASE ee_dev TO ee_dev;
ee_dev=> GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO ee_dev;
ee_dev=> ALTER DATABASE ee_dev OWNER TO ee_dev;

..and the equivalent for the ee_staging user/DB.

(may not have actually happened in that exact order, but that's the most coherent way to present it 😄 )

One of the reasons I mention this is I suspect this is not identical to the way the other DBs/users on this host are set up.

Then in AWS Parameter Store I set the

EE_DATABASE_HOST
EE_DATABASE_NAME
EE_DATABASE_USER
EE_DATABASE_PASSWORD

environment variables for each of the two environments (dev and staging). Dev is confirmed working deployed with those settings. We haven't yet deployed to staging, but should deploy successfully (now that I've got the permissions right on dev) but is as yet untested. As another issue, EE staging can not be accessed due to a CloudFront configuration issue, but that is another issue for another time.

chris48s · 2024-08-13T09:56:25Z

I'm going to deploy this one first to make sure the staging RDS setup is fine. Then I will do #2198 second once I've checked that

chris48s added 3 commits July 30, 2024 15:11

remove all per-instance cron jobs

c089108

re-enable cluster-wide 'snoop' job

36bf798

use add_nuts1.sh script

8877b8c

chris48s commented Jul 30, 2024

View reviewed changes

chris48s requested a review from symroe July 30, 2024 14:41

symroe approved these changes Aug 13, 2024

View reviewed changes

don't run rebuild_local_db on every deploy

b8c57f5

chris48s force-pushed the crontab20243007 branch from a0dae60 to b8c57f5 Compare August 13, 2024 10:00

safety: upgrade django

981ddb6

chris48s merged commit 4e0452b into master Aug 13, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove per-instance crontab entries #2200

remove per-instance crontab entries #2200

chris48s commented Jul 30, 2024

chris48s Jul 30, 2024

chris48s Jul 30, 2024

symroe Jul 31, 2024

chris48s Jul 30, 2024

chris48s Jul 30, 2024

chris48s Jul 30, 2024

symroe Jul 31, 2024

chris48s Jul 31, 2024

symroe Jul 31, 2024

chris48s Jul 30, 2024

symroe Jul 31, 2024

chris48s Jul 31, 2024

symroe Jul 31, 2024

chris48s commented Jul 30, 2024

symroe commented Jul 31, 2024

chris48s commented Jul 31, 2024

coveralls commented Jul 31, 2024 •

edited

Loading

chris48s commented Jul 31, 2024 •

edited

Loading

chris48s commented Jul 31, 2024

chris48s commented Aug 8, 2024

chris48s commented Aug 13, 2024

remove per-instance crontab entries #2200

remove per-instance crontab entries #2200

Conversation

chris48s commented Jul 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chris48s commented Jul 30, 2024

symroe commented Jul 31, 2024

chris48s commented Jul 31, 2024

coveralls commented Jul 31, 2024 • edited Loading

chris48s commented Jul 31, 2024 • edited Loading

chris48s commented Jul 31, 2024

chris48s commented Aug 8, 2024

chris48s commented Aug 13, 2024

coveralls commented Jul 31, 2024 •

edited

Loading

chris48s commented Jul 31, 2024 •

edited

Loading