New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds batch jobs for long-running geometry creation #211
Conversation
2 similar comments
@dmed256, could you take a look at this? |
cartoframes/context.py
Outdated
'''.format(table_name=final_table_name, | ||
lng=lnglat[0], | ||
lat=lnglat[1]) | ||
if df.shape[0] > 100000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be:
if df.shape[0] > MAX_IMPORT_ROWS:
cartoframes/context.py
Outdated
batch_client = BatchSQLClient(self.auth_client) | ||
status = batch_client.create([query, ]) | ||
tqdm.write('Table successfully written to CARTO: ' | ||
'{base_url}dataset/{table_name} . `the_geom` ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure the /
are there:
os.path.join(base_url, 'dataset', table_name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's ok the way you have it, it's in other places in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well be careful
test/test_context.py
Outdated
@@ -189,6 +195,16 @@ def test_cartocontext_write(self): | |||
self.assertEqual(resp['rows'][0]['num_rows'], | |||
resp['rows'][0]['num_geoms']) | |||
|
|||
# test batch lnglat behavior | |||
n_rows = 100001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n_rows = context.MAX_IMPORT_ROWS + 1
cartoframes/context.py
Outdated
'last_status=\'{status}\')'.format(job_id=self.job_id, | ||
status=self.last_status)) | ||
|
||
def status(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmed256 you suggested in #85 a method for returning dataframes from longer running queries. Here we could have a flag or another method for returning the lnglat'd version of the table/dataframe. Soon, I envision us adding more than the lnglat flag for adding geometries.. e.g., street-level geocoding, routing, etc.
But here, we could have a simple new method (or flag to status), which indicates that when status=done
, do cc.read('tablename')
to return the carto'd dataframe.
What do you think? Too soon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think too soon. I want to scope it out more before including anything else here.
1 similar comment
To avoid the issues discussed in #153 and #210, I've done some refactoring to investigate more efficient ways of getting the tables to (1) display in the dashboard and get a dataset page without being a ghost table, and (2) avoid timeouts by finding cleverer ways of populating the geometry field by avoiding Some timings without the Batch SQL API for the entire write process: |
Changes Unknown when pulling 33171e0 on batch-lnglat-update into ** on master**. |
For larger tables written to carto with the
lnglat
flag, the geometry creation step times out. To get around this, we can send the job to the SQL Batch API. The upside is that there will not be a timeout, the downside is that it has to sit in a queue.The new flow for
cc.write
is that if the dataframe has more than 1e5 rows, the geometry creation query will be sent to the Batch SQL API and aBatchJobStatus
object will be returned. This object can be polled with the.status()
method. If the number of rows is at or below 1e5, the normal SQL API will be used for geometry creation andcc.write
will returnNone
.Example usage:
ToDos
pandas.groupby
into a generatorBatchJobStatus