Create monthly database backup task #19

danrademacher · 2019-02-20T02:17:48Z

Python script to do the following:

Make a copy of crashes_all_prod with some name like crashes_all_prod_achrive_YYYYMMDD
Delete previous month's backup to preserve disk space

Then

set on monthly schedule using Heroku Scheduler

┆Issue is synchronized with this Asana task

The text was updated successfully, but these errors were encountered:

danrademacher · 2019-02-22T18:07:08Z

@fahadkirmani for this task, take a look at the CARTO.com SWL API, https://carto.com/developers/sql-api/

You can see how we're doing the incremental updates in https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py and another set of tasks over at https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies

So for this we'd end up with a monthlybackup.py that would run using Heroku Scheduler. The main challenge here I think is going to be dealing with timeouts or limits on the SQL API given the size of the table. We're not sure yet if duplicating the table within CARTO will run into those or be exempt. Part of the work is to see if that's an issue and then figure out workarounds.

The CARTO credentials are in the env variables that you should be able to access in Heroku.

Let's use this issue for discussion and findings, and then we can pull in my lead dev once you've gotten to the point of having questions.

fahadkirmani · 2019-02-24T16:41:51Z

@danrademacher I have gone over the links you have mentioned and also configured the environment on my ubuntu 16.04 Xenial . When i tried running any query on carto https://carto.com/developers/sql-api/
I end up in error i think because of the CARTO_API_KEY which i think i need to get the data from it for comparing the data of SODA and CARTO sources. I have also installed the Heroku environment on the machine. I am also logged in Heroku. Also cloned the git python code of nyc-crash-mapper-etl-script.
Kindly i need some help in moving forward for running these scripts in my environment.

danrademacher · 2019-02-25T22:20:42Z

You can find the CARTO API key under "config vars" here:
https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/settings

fahadkirmani · 2019-02-27T13:47:44Z

I am trying this request https://{username}.carto.com/api/v2/sql?q={SQL statement}&api_key={api_key} to get the results but getting this message {"error":["Unauthorized"]}
Using the CARTO API KEY as you mentioned above and user name and table name from this script 2-update_carto.py
Are these credentials are okay?

danrademacher · 2019-02-27T13:58:39Z

Those credentials should work since they are the ones running the daily script (which I just confirmed in Heroku logs is running fine), so let's have @gregallensworth take a look at this and see if he can get you going. Gregor, I have a new job to tie this time to, CHEKPEDS:New Dev Training

gregallensworth · 2019-02-27T20:04:08Z

Working for me. I wrote a quick shell script which creates a URL to SELECT COUNT(*) FROM crashes_all_prod and confirmed that it displays the expected JSON code, and that if I supply an incorrect APIKEY that I do not get the expected JSON code.

#!/bin/sh

APIKEY="ABCXYZ"
USERNAME="chekpeds"
SQL="SELECT COUNT(*) FROM crashes_all_prod"

URL="https://$USERNAME.carto.com/api/v2/sql?api_key=$APIKEY&q=$SQL"

echo "Connecting..."
wget --quiet -O - "$URL"
echo ""
echo "Done"

The expected output is a row count (which will increase as more crashes are added each day):

{"rows":[{"count":1548595}],"time":0.597,"fields":{"count":{"type":"number"}},"total_rows":1}

@fahadkirmani If you run a simple SELECT query as I did, does it work or not? If it does, then the issue would be permissions for your given SQL statement. If it does not, then you may have copied the API key incorrectly.

fahadkirmani · 2019-02-28T07:58:04Z

@gregallensworth Yes i got successful response of the request
https://$USERNAME.carto.com/api/v2/sql?api_key=$APIKEY&q=$SQL

{"rows":[{"count":1548595}],"time":0.499,"fields":{"count":{"type":"number"}},"total_rows":1}
Thanks for help.

danrademacher · 2019-03-29T01:59:51Z

@fahadkirmani have you been making progress here? Based on the last comment, it seemed like we resolved the API access issue. Have you been working on this since?

fahadkirmani · 2019-03-30T15:44:12Z

@danrademacher Yes i got success in it and i did this task https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies and successfully updated the carto data between this duration 2019-01-01 to 2019-03-28 . Also worked on it shell script so all this process can be run using shell script for my ease.
python 2-update_carto.py
Load crash_diffs.csv
4099198
{"rows":[],"time":0.791,"fields":{},"total_rows":1}
4098296
{"rows":[],"time":0.636,"fields":{},"total_rows":1}
4059929
{"rows":[],"time":0.626,"fields":{},"total_rows":1}
4097867
{"rows":[],"time":0.624,"fields":{},"total_rows":1}
4087063
{"rows":[],"time":0.62,"fields":{},"total_rows":1}
4070120
{"rows":[],"time":0.631,"fields":{},"total_rows":1}
4096730
{"rows":[],"time":0.631,"fields":{},"total_rows":1}
4096994
{"rows":[],"time":0.609,"fields":{},"total_rows":1}
4092290
{"rows":[],"time":0.61,"fields":{},"total_rows":1}
4071106
{"rows":[],"time":0.634,"fields":{},"total_rows":1}
4093707
{"rows":[],"time":0.624,"fields":{},"total_rows":1}
4096017
{"rows":[],"time":0.62,"fields":{},"total_rows":1}
4097447
{"rows":[],"time":0.652,"fields":{},"total_rows":1}
4098281

fahadkirmani · 2019-08-27T16:26:57Z

Hi @danrademacher ,

There two socrata_id(4181883,4025853) are not present in CARTO but they are there in SODA unique_key(4181883,4025853). I have updated the records but I think they(scripts I have) only update the data on CARTO from SODA but do not insert new records in CARTO.

09/10/2018 --> 4181883
04/18/2018 --> 4025853

https://postimg.cc/PL85vkmq

Can you check why some IDs are getting skipped as I don't see any information is missing for those two records in SODA . Also lon,lat information is present there in SODA for those two rows.

gregallensworth · 2019-08-27T20:27:11Z

The leading reason that a crash would not have been loaded into CARTO, is that some crashes are not logged into Socrata for weeks or even months after the fact.

Original versions of the ETL script ran daily, and looked for crashes dated the prior day (e.g. today, it would query date >= '2019-08-26'). We didn't notice for quite some time, that this was missing crashes. The ETL script these days looks back two months as a balance between finding slow data entry and trying to digest too much data and crashing.

Crash 4181883 definitely fits into that profile. They waited 11 months to enter it, and we look back only 2 months.

"date": "2018-09-10T00:00:00.000",
":created_at": "2019-08-26T23:02:03.529Z"
":updated_at": "2019-08-26T23:02:44.636Z"

Crash 4025853 they waited 16 months:

"date": "2018-04-18T00:00:00.000",
":created_at": "2019-08-26T23:05:17.894Z",
":updated_at": "2019-08-26T23:05:58.222Z",

In order to address these, I wrote the backlog/ script, which you'll find in this same repository. This accepts a given year and month (e.g. "2018-09") and specifically scans Socrata for records in that month which are not in CARTO.

danrademacher assigned fahadkirmani Feb 22, 2019

danrademacher assigned gregallensworth Feb 25, 2019

danrademacher unassigned gregallensworth Feb 25, 2019

danrademacher assigned gregallensworth Feb 27, 2019

gregallensworth removed their assignment Jun 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create monthly database backup task #19

Create monthly database backup task #19

danrademacher commented Feb 20, 2019 •

edited by sync-by-unito bot

Loading

danrademacher commented Feb 22, 2019

fahadkirmani commented Feb 24, 2019

danrademacher commented Feb 25, 2019

fahadkirmani commented Feb 27, 2019

danrademacher commented Feb 27, 2019

gregallensworth commented Feb 27, 2019

fahadkirmani commented Feb 28, 2019

danrademacher commented Mar 29, 2019

fahadkirmani commented Mar 30, 2019

fahadkirmani commented Aug 27, 2019 •

edited

Loading

gregallensworth commented Aug 27, 2019

Create monthly database backup task #19

Create monthly database backup task #19

Comments

danrademacher commented Feb 20, 2019 • edited by sync-by-unito bot Loading

danrademacher commented Feb 22, 2019

fahadkirmani commented Feb 24, 2019

danrademacher commented Feb 25, 2019

fahadkirmani commented Feb 27, 2019

danrademacher commented Feb 27, 2019

gregallensworth commented Feb 27, 2019

fahadkirmani commented Feb 28, 2019

danrademacher commented Mar 29, 2019

fahadkirmani commented Mar 30, 2019

fahadkirmani commented Aug 27, 2019 • edited Loading

gregallensworth commented Aug 27, 2019

danrademacher commented Feb 20, 2019 •

edited by sync-by-unito bot

Loading

fahadkirmani commented Aug 27, 2019 •

edited

Loading