-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COPY TO for data obs queries - Part 2 use fetch #592
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a few minor things that we need to figure out more.
warn( | ||
'{s0} was augmented as {s1} because of name ' | ||
'collision'.format(s0=suggested, s1=names[suggested]) | ||
) | ||
else: | ||
names[suggested] = suggested | ||
|
||
# drop description columns to lighten the query |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
It's also worth exploring what the minimum set of metadata is we need to send. My guess is that only half of the metadata columns are needed to uniquely define the measure.
@@ -334,6 +344,8 @@ def clean_dataframe_from_carto(df, table_columns, decode_geom=False): | |||
for column_name in table_columns: | |||
if table_columns[column_name]['type'] == 'date': | |||
df[column_name] = pd.to_datetime(df[column_name], errors='ignore') | |||
elif table_columns[column_name]['type'] == 'boolean': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I wasn't aware that bools weren't properly converted. Maybe the second bullet item in the docstring should be expanded to include process date and boolean columns
columns = get_columns(self.cc, query).keys() | ||
|
||
if exclude and isinstance(exclude, list): | ||
columns = list(set(columns) - set(exclude)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure that exclude
is a list or tuple and not a str because:
>>> exclude = 'the_geom_webmercator'
>>> set(exclude)
{'_', 'a', 'b', 'c', 'e', 'g', 'h', 'm', 'o', 'r', 't', 'w'}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, we are already checking that it is an instance of list
. Anyway we'll add proper docs once all these new methods are consolidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, I missed the isinstance(exclude, list)
above 🤦♂️ Sorry for the noise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np 😅
)['fields'].keys() | ||
# get column names except the_geom_webmercator | ||
dataset = Dataset(self, table_name) | ||
table_columns = dataset.get_table_column_names(exclude=['the_geom_webmercator']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the DEFAULT_SQL_ARGS
are the way to get the cartoframes calls to show up in logs so we can monitor performance, etc. They're fit in so that each method get registered once despite having multiple calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmm is this still true? I see DEFAULT_SQL_ARGS just having a do_post
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Are POST request args searchable in kibana?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oleurud ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The POST arguments are not saved by default. The only things we can check in Kibana about a POST request to SQL API are the queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two options:
- Use the user agent header, which is designed to do exactly this (and it's how we do it in carto-python)
- Add a SQL comment to the queries (seems hacky)
I'd do the user agent thing, which should not be hard to do and it makes it searchable everywhere: Kibana, Rollbar, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case it will be solved with a user agent header, I had created this issue the handle it #601
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the user-agent thing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also (CartoDB/carto-python#111).
Nice, well solved 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
…into 565_data_obs_part_2
cartoframes/context.py
Outdated
@@ -1099,8 +1099,7 @@ def data_boundaries(self, boundary=None, region=None, decode_geom=False, | |||
boundaries in `region` (or the world if `region` is ``None``) | |||
""" | |||
# TODO: create a function out of this? | |||
if (isinstance(region, collections.Iterable) | |||
and not isinstance(region, str)): | |||
if (isinstance(region, collections.Iterable) and not isinstance(region, str)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe check the isintance(str)
first, so we don't have to do the negative check? e.g:
if isinstance(str)
...
elif isinstance(Iterable)
...
cartoframes/context.py
Outdated
@@ -1294,7 +1292,7 @@ def data_discovery(self, region, keywords=None, regex=None, time=None, | |||
except ValueError: | |||
# TODO: make this work for general queries | |||
# see if it's a table | |||
self.sql_client.send( | |||
self.batch_sql_client.create_and_wait_for_completion( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does batch SQL work with EXPLAIN
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm the point is that the result of the sql_client
was not being stored anywhere so I assume the query is just to check if the table exists, so the batch_sql_client
works for that purpose as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems overkill to use the batch API (which might take a while to run due to scheduling). I'd use the normal SQL API if the change is easy.
cartoframes/context.py
Outdated
|
||
result = self.fetch(query) | ||
if persist_as: | ||
self.write(result, persist_as, overwrite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be smarter and if you need to store it, just do a Batch API SELECT INTO
and then download the result (or not even download the result)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I've added a Dataset.from_query
method that creates a table from the query. Then we use Dataset.download
to get the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blessings
Review after #591 has been merged
Closes #565