Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds batch jobs for long-running geometry creation #211

Merged
merged 13 commits into from Sep 22, 2017

Conversation

andy-esch
Copy link
Contributor

@andy-esch andy-esch commented Sep 12, 2017

For larger tables written to carto with the lnglat flag, the geometry creation step times out. To get around this, we can send the job to the SQL Batch API. The upside is that there will not be a timeout, the downside is that it has to sit in a queue.

The new flow for cc.write is that if the dataframe has more than 1e5 rows, the geometry creation query will be sent to the Batch SQL API and a BatchJobStatus object will be returned. This object can be polled with the .status() method. If the number of rows is at or below 1e5, the normal SQL API will be used for geometry creation and cc.write will return None.

Example usage:

import time
job = cc.write(df, 'new_table',
               lnglat=('lng_col', 'lat_col'))
while job is not None:
    curr_status = job.status()['status']
    if curr_status in ('done', 'failed', 'canceled', 'unknown', ):
        print(curr_status)
        break
    time.sleep(5)

ToDos

  • turn the pandas.groupby into a generator
  • ensure that there's test coverage to all the new pieces
    • BatchJobStatus
    • lnglat for batch jobs

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage decreased (-1.05%) to 89.474% when pulling 6b9a1a8 on batch-lnglat-update into 17d71bb on master.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage increased (+0.06%) to 90.582% when pulling ecc8cd0 on batch-lnglat-update into 17d71bb on master.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage increased (+0.2%) to 90.72% when pulling fb81885 on batch-lnglat-update into 17d71bb on master.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 90.72% when pulling fb81885 on batch-lnglat-update into 17d71bb on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) to 90.72% when pulling fb81885 on batch-lnglat-update into 17d71bb on master.

@andy-esch
Copy link
Contributor Author

@dmed256, could you take a look at this?

@CartoDB CartoDB deleted a comment from coveralls Sep 12, 2017
'''.format(table_name=final_table_name,
lng=lnglat[0],
lat=lnglat[1])
if df.shape[0] > 100000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be:
if df.shape[0] > MAX_IMPORT_ROWS:

batch_client = BatchSQLClient(self.auth_client)
status = batch_client.create([query, ])
tqdm.write('Table successfully written to CARTO: '
'{base_url}dataset/{table_name} . `the_geom` '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure the / are there:

os.path.join(base_url, 'dataset', table_name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's ok the way you have it, it's in other places in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well be careful

@@ -189,6 +195,16 @@ def test_cartocontext_write(self):
self.assertEqual(resp['rows'][0]['num_rows'],
resp['rows'][0]['num_geoms'])

# test batch lnglat behavior
n_rows = 100001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_rows = context.MAX_IMPORT_ROWS + 1

'last_status=\'{status}\')'.format(job_id=self.job_id,
status=self.last_status))

def status(self):
Copy link
Contributor Author

@andy-esch andy-esch Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmed256 you suggested in #85 a method for returning dataframes from longer running queries. Here we could have a flag or another method for returning the lnglat'd version of the table/dataframe. Soon, I envision us adding more than the lnglat flag for adding geometries.. e.g., street-level geocoding, routing, etc.

But here, we could have a simple new method (or flag to status), which indicates that when status=done, do cc.read('tablename') to return the carto'd dataframe.

What do you think? Too soon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think too soon. I want to scope it out more before including anything else here.

@coveralls
Copy link

coveralls commented Sep 12, 2017

Coverage Status

Coverage increased (+0.2%) to 90.733% when pulling 75ecd0b on batch-lnglat-update into 17d71bb on master.

@coveralls
Copy link

coveralls commented Sep 13, 2017

Coverage Status

Coverage increased (+0.2%) to 90.733% when pulling fa2d5b6 on batch-lnglat-update into 17d71bb on master.

1 similar comment
@coveralls
Copy link

coveralls commented Sep 13, 2017

Coverage Status

Coverage increased (+0.2%) to 90.733% when pulling fa2d5b6 on batch-lnglat-update into 17d71bb on master.

@andy-esch
Copy link
Contributor Author

To avoid the issues discussed in #153 and #210, I've done some refactoring to investigate more efficient ways of getting the tables to (1) display in the dashboard and get a dataset page without being a ghost table, and (2) avoid timeouts by finding cleverer ways of populating the geometry field by avoiding UPDATE statements and offloading the queries to the Batch SQL API.

Some timings without the Batch SQL API for the entire write process:
screen shot 2017-09-14 at 12 51 44

@CartoDB CartoDB deleted a comment from coveralls Sep 21, 2017
@CartoDB CartoDB deleted a comment from coveralls Sep 21, 2017
@CartoDB CartoDB deleted a comment from coveralls Sep 21, 2017
@CartoDB CartoDB deleted a comment from coveralls Sep 21, 2017
@CartoDB CartoDB deleted a comment from coveralls Sep 21, 2017
@coveralls
Copy link

coveralls commented Sep 21, 2017

Coverage Status

Changes Unknown when pulling 33171e0 on batch-lnglat-update into ** on master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants