New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upload table using to_carto in chunks #1676
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
I just add some comments about missing docs and perf improvements.
cartoframes/io/carto.py
Outdated
@@ -93,6 +96,11 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col | |||
ValueError: if the dataframe or table name provided are wrong or the if_exists param is not valid. | |||
|
|||
""" | |||
def estimate_csv_size(gdf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move estimate_csv_size
to utils.utils
, also compute_copy_data
because we are using it in different places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't moved it to utils.utils
because that would mean that utils.utils
would depend on io.managers.context_manager
and as that file already depends on utils.utils
that would create a circular dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'll move it to utils.columns
then or at least out of the to_carto
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utils.columns
file has the same problem with the circular dependencies.
I have tried extracting the method and all of it's dependencies to utils.utils
or utils.columns
but in that case I find circular dependencies between utils.utils
and utils.columns
Finally I have extracted the estimate_csv_size
method to the carto.py
file but outside the to_carto
method
cartoframes/io/carto.py
Outdated
def estimate_csv_size(gdf): | ||
n = min(100, len(gdf)) | ||
return len(''.join([x.decode("utf-8") for x in | ||
_compute_copy_data(gdf.sample(n=n), get_dataframe_columns_info(gdf))])) * len(gdf) / n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would extract this get_dataframe_columns_info(gdf)
to a variable columns_info before the loop.
Also, we could try something like sum([len(x) for x in ...])
to avoid the decode and join so it will be a bit faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch, on it!
cartoframes/io/carto.py
Outdated
@@ -93,6 +96,11 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col | |||
ValueError: if the dataframe or table name provided are wrong or the if_exists param is not valid. | |||
|
|||
""" | |||
def estimate_csv_size(gdf): | |||
n = min(100, len(gdf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would update 100 when we have the information of which value is enough for the precision we want for the estimation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but spoiler, that looks like a good number
cartoframes/io/carto.py
Outdated
@@ -130,7 +138,14 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col | |||
elif isinstance(dataframe, GeoDataFrame): | |||
log.warning('Geometry column not found in the GeoDataFrame.') | |||
|
|||
table_name = context_manager.copy_from(gdf, table_name, if_exists, cartodbfy) | |||
chunk_count = int(math.ceil(estimate_csv_size(gdf) / max_upload_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove the int(
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally!
@@ -21,6 +21,27 @@ | |||
DEFAULT_RETRY_TIMES = 3 | |||
|
|||
|
|||
def retry_copy(func): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice decorator!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
tests/unit/io/test_carto.py
Outdated
norm_table_name = to_carto(gdf, table_name, CREDENTIALS, max_upload_size=100000) | ||
|
||
# Then | ||
assert cm_mock.call_count == 12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that there are 12 chunks? Could you add a note with the info about where this comes from (max_size and size)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added note 👍
@Jesus89 second CR round? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
to_carto
in chunks. The estimation for the file size is performed extracting a sample of 100 rows of the dataframe and computing the actual size of the exported file using the same mechanism of the actual export to CSV (WKB and so on)_copy_to
and_copy_from
functions