Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create monthly database backup task #19

Open
3 tasks
danrademacher opened this issue Feb 20, 2019 · 11 comments
Open
3 tasks

Create monthly database backup task #19

danrademacher opened this issue Feb 20, 2019 · 11 comments
Assignees

Comments

@danrademacher
Copy link
Member

danrademacher commented Feb 20, 2019

Python script to do the following:

  • Make a copy of crashes_all_prod with some name like crashes_all_prod_achrive_YYYYMMDD
  • Delete previous month's backup to preserve disk space

Then

  • set on monthly schedule using Heroku Scheduler

┆Issue is synchronized with this Asana task

@danrademacher
Copy link
Member Author

@fahadkirmani for this task, take a look at the CARTO.com SWL API, https://carto.com/developers/sql-api/

You can see how we're doing the incremental updates in https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py and another set of tasks over at https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies

So for this we'd end up with a monthlybackup.py that would run using Heroku Scheduler. The main challenge here I think is going to be dealing with timeouts or limits on the SQL API given the size of the table. We're not sure yet if duplicating the table within CARTO will run into those or be exempt. Part of the work is to see if that's an issue and then figure out workarounds.

The CARTO credentials are in the env variables that you should be able to access in Heroku.

Let's use this issue for discussion and findings, and then we can pull in my lead dev once you've gotten to the point of having questions.

@fahadkirmani
Copy link
Collaborator

@danrademacher I have gone over the links you have mentioned and also configured the environment on my ubuntu 16.04 Xenial . When i tried running any query on carto https://carto.com/developers/sql-api/
I end up in error i think because of the CARTO_API_KEY which i think i need to get the data from it for comparing the data of SODA and CARTO sources. I have also installed the Heroku environment on the machine. I am also logged in Heroku. Also cloned the git python code of nyc-crash-mapper-etl-script.
Kindly i need some help in moving forward for running these scripts in my environment.

@danrademacher
Copy link
Member Author

You can find the CARTO API key under "config vars" here:
https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/settings

@fahadkirmani
Copy link
Collaborator

I am trying this request https://{username}.carto.com/api/v2/sql?q={SQL statement}&api_key={api_key} to get the results but getting this message {"error":["Unauthorized"]}
Using the CARTO API KEY as you mentioned above and user name and table name from this script 2-update_carto.py
Are these credentials are okay?

@danrademacher
Copy link
Member Author

Those credentials should work since they are the ones running the daily script (which I just confirmed in Heroku logs is running fine), so let's have @gregallensworth take a look at this and see if he can get you going. Gregor, I have a new job to tie this time to, CHEKPEDS:New Dev Training

@gregallensworth
Copy link
Member

Working for me. I wrote a quick shell script which creates a URL to SELECT COUNT(*) FROM crashes_all_prod and confirmed that it displays the expected JSON code, and that if I supply an incorrect APIKEY that I do not get the expected JSON code.

#!/bin/sh

APIKEY="ABCXYZ"
USERNAME="chekpeds"
SQL="SELECT COUNT(*) FROM crashes_all_prod"

URL="https://$USERNAME.carto.com/api/v2/sql?api_key=$APIKEY&q=$SQL"

echo "Connecting..."
wget --quiet -O - "$URL"
echo ""
echo "Done"

The expected output is a row count (which will increase as more crashes are added each day):

{"rows":[{"count":1548595}],"time":0.597,"fields":{"count":{"type":"number"}},"total_rows":1}

@fahadkirmani If you run a simple SELECT query as I did, does it work or not? If it does, then the issue would be permissions for your given SQL statement. If it does not, then you may have copied the API key incorrectly.

@fahadkirmani
Copy link
Collaborator

@gregallensworth Yes i got successful response of the request
https://$USERNAME.carto.com/api/v2/sql?api_key=$APIKEY&q=$SQL

{"rows":[{"count":1548595}],"time":0.499,"fields":{"count":{"type":"number"}},"total_rows":1}
Thanks for help.

@danrademacher
Copy link
Member Author

@fahadkirmani have you been making progress here? Based on the last comment, it seemed like we resolved the API access issue. Have you been working on this since?

@fahadkirmani
Copy link
Collaborator

@danrademacher Yes i got success in it and i did this task https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies and successfully updated the carto data between this duration 2019-01-01 to 2019-03-28 . Also worked on it shell script so all this process can be run using shell script for my ease.
python 2-update_carto.py
Load crash_diffs.csv
4099198
{"rows":[],"time":0.791,"fields":{},"total_rows":1}
4098296
{"rows":[],"time":0.636,"fields":{},"total_rows":1}
4059929
{"rows":[],"time":0.626,"fields":{},"total_rows":1}
4097867
{"rows":[],"time":0.624,"fields":{},"total_rows":1}
4087063
{"rows":[],"time":0.62,"fields":{},"total_rows":1}
4070120
{"rows":[],"time":0.631,"fields":{},"total_rows":1}
4096730
{"rows":[],"time":0.631,"fields":{},"total_rows":1}
4096994
{"rows":[],"time":0.609,"fields":{},"total_rows":1}
4092290
{"rows":[],"time":0.61,"fields":{},"total_rows":1}
4071106
{"rows":[],"time":0.634,"fields":{},"total_rows":1}
4093707
{"rows":[],"time":0.624,"fields":{},"total_rows":1}
4096017
{"rows":[],"time":0.62,"fields":{},"total_rows":1}
4097447
{"rows":[],"time":0.652,"fields":{},"total_rows":1}
4098281

@gregallensworth gregallensworth removed their assignment Jun 20, 2019
@fahadkirmani
Copy link
Collaborator

fahadkirmani commented Aug 27, 2019

Hi @danrademacher ,

There two socrata_id(4181883,4025853) are not present in CARTO but they are there in SODA unique_key(4181883,4025853). I have updated the records but I think they(scripts I have) only update the data on CARTO from SODA but do not insert new records in CARTO.

09/10/2018 --> 4181883
04/18/2018 --> 4025853

image

https://postimg.cc/PL85vkmq

Can you check why some IDs are getting skipped as I don't see any information is missing for those two records in SODA . Also lon,lat information is present there in SODA for those two rows.

@gregallensworth
Copy link
Member

The leading reason that a crash would not have been loaded into CARTO, is that some crashes are not logged into Socrata for weeks or even months after the fact.

Original versions of the ETL script ran daily, and looked for crashes dated the prior day (e.g. today, it would query date >= '2019-08-26'). We didn't notice for quite some time, that this was missing crashes. The ETL script these days looks back two months as a balance between finding slow data entry and trying to digest too much data and crashing.

Crash 4181883 definitely fits into that profile. They waited 11 months to enter it, and we look back only 2 months.

"date": "2018-09-10T00:00:00.000",
":created_at": "2019-08-26T23:02:03.529Z"
":updated_at": "2019-08-26T23:02:44.636Z"

Crash 4025853 they waited 16 months:

"date": "2018-04-18T00:00:00.000",
":created_at": "2019-08-26T23:05:17.894Z",
":updated_at": "2019-08-26T23:05:58.222Z",

In order to address these, I wrote the backlog/ script, which you'll find in this same repository. This accepts a given year and month (e.g. "2018-09") and specifically scans Socrata for records in that month which are not in CARTO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants