Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COPY TO for data obs queries - Part 2 use fetch #592

Merged
merged 15 commits into from
Apr 11, 2019
Merged

Conversation

alrocar
Copy link
Contributor

@alrocar alrocar commented Apr 5, 2019

Review after #591 has been merged

Closes #565

@alrocar alrocar changed the base branch from master to 565_data_obs April 5, 2019 11:14
@alrocar alrocar changed the title Use fetch in data observatory queries COPY TO for data obs queries - Part 2 use fetch Apr 5, 2019
Copy link
Contributor

@andy-esch andy-esch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a few minor things that we need to figure out more.

warn(
'{s0} was augmented as {s1} because of name '
'collision'.format(s0=suggested, s1=names[suggested])
)
else:
names[suggested] = suggested

# drop description columns to lighten the query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️
It's also worth exploring what the minimum set of metadata is we need to send. My guess is that only half of the metadata columns are needed to uniquely define the measure.

@@ -334,6 +344,8 @@ def clean_dataframe_from_carto(df, table_columns, decode_geom=False):
for column_name in table_columns:
if table_columns[column_name]['type'] == 'date':
df[column_name] = pd.to_datetime(df[column_name], errors='ignore')
elif table_columns[column_name]['type'] == 'boolean':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I wasn't aware that bools weren't properly converted. Maybe the second bullet item in the docstring should be expanded to include process date and boolean columns

columns = get_columns(self.cc, query).keys()

if exclude and isinstance(exclude, list):
columns = list(set(columns) - set(exclude))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure that exclude is a list or tuple and not a str because:

>>> exclude = 'the_geom_webmercator'
>>> set(exclude)
{'_', 'a', 'b', 'c', 'e', 'g', 'h', 'm', 'o', 'r', 't', 'w'}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, we are already checking that it is an instance of list. Anyway we'll add proper docs once all these new methods are consolidated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, I missed the isinstance(exclude, list) above 🤦‍♂️ Sorry for the noise!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np 😅

)['fields'].keys()
# get column names except the_geom_webmercator
dataset = Dataset(self, table_name)
table_columns = dataset.get_table_column_names(exclude=['the_geom_webmercator'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the DEFAULT_SQL_ARGS are the way to get the cartoframes calls to show up in logs so we can monitor performance, etc. They're fit in so that each method get registered once despite having multiple calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm is this still true? I see DEFAULT_SQL_ARGS just having a do_post attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Are POST request args searchable in kibana?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The POST arguments are not saved by default. The only things we can check in Kibana about a POST request to SQL API are the queries

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two options:

  • Use the user agent header, which is designed to do exactly this (and it's how we do it in carto-python)
  • Add a SQL comment to the queries (seems hacky)

I'd do the user agent thing, which should not be hard to do and it makes it searchable everywhere: Kibana, Rollbar, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case it will be solved with a user agent header, I had created this issue the handle it #601

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the user-agent thing here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (CartoDB/carto-python#111).

Nice, well solved 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

@alrocar alrocar requested a review from oleurud April 8, 2019 08:49
@@ -1099,8 +1099,7 @@ def data_boundaries(self, boundary=None, region=None, decode_geom=False,
boundaries in `region` (or the world if `region` is ``None``)
"""
# TODO: create a function out of this?
if (isinstance(region, collections.Iterable)
and not isinstance(region, str)):
if (isinstance(region, collections.Iterable) and not isinstance(region, str)):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check the isintance(str) first, so we don't have to do the negative check? e.g:

if isinstance(str)
   ...
elif isinstance(Iterable)
   ...

@@ -1294,7 +1292,7 @@ def data_discovery(self, region, keywords=None, regex=None, time=None,
except ValueError:
# TODO: make this work for general queries
# see if it's a table
self.sql_client.send(
self.batch_sql_client.create_and_wait_for_completion(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does batch SQL work with EXPLAIN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm the point is that the result of the sql_client was not being stored anywhere so I assume the query is just to check if the table exists, so the batch_sql_client works for that purpose as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems overkill to use the batch API (which might take a while to run due to scheduling). I'd use the normal SQL API if the change is easy.


result = self.fetch(query)
if persist_as:
self.write(result, persist_as, overwrite=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be smarter and if you need to store it, just do a Batch API SELECT INTO and then download the result (or not even download the result)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I've added a Dataset.from_query method that creates a table from the query. Then we use Dataset.download to get the result.

Copy link
Contributor

@oleurud oleurud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blessings

@alrocar alrocar changed the base branch from 565_data_obs to master April 11, 2019 21:09
@alrocar alrocar merged commit 7c039a3 into master Apr 11, 2019
@Jesus89 Jesus89 deleted the 565_data_obs_part_2 branch September 30, 2019 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants