Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload table using to_carto in chunks #1676

Merged
merged 13 commits into from Aug 24, 2020

Conversation

antoniocarlon
Copy link
Contributor

@antoniocarlon antoniocarlon commented Aug 21, 2020

  • Upload table using to_carto in chunks. The estimation for the file size is performed extracting a sample of 100 rows of the dataframe and computing the actual size of the exported file using the same mechanism of the actual export to CSV (WKB and so on)
  • Added retry decorator for the _copy_to and _copy_from functions
  • Added tests for the new functionalities

Copy link
Member

@Jesus89 Jesus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

I just add some comments about missing docs and perf improvements.

@@ -93,6 +96,11 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col
ValueError: if the dataframe or table name provided are wrong or the if_exists param is not valid.

"""
def estimate_csv_size(gdf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move estimate_csv_size to utils.utils, also compute_copy_data because we are using it in different places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't moved it to utils.utils because that would mean that utils.utils would depend on io.managers.context_manager and as that file already depends on utils.utils that would create a circular dependency

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll move it to utils.columns then or at least out of the to_carto function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The utils.columns file has the same problem with the circular dependencies.

I have tried extracting the method and all of it's dependencies to utils.utils or utils.columns but in that case I find circular dependencies between utils.utils and utils.columns

Finally I have extracted the estimate_csv_size method to the carto.py file but outside the to_carto method

cartoframes/io/carto.py Show resolved Hide resolved
def estimate_csv_size(gdf):
n = min(100, len(gdf))
return len(''.join([x.decode("utf-8") for x in
_compute_copy_data(gdf.sample(n=n), get_dataframe_columns_info(gdf))])) * len(gdf) / n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would extract this get_dataframe_columns_info(gdf) to a variable columns_info before the loop.

Also, we could try something like sum([len(x) for x in ...]) to avoid the decode and join so it will be a bit faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch, on it!

@@ -93,6 +96,11 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col
ValueError: if the dataframe or table name provided are wrong or the if_exists param is not valid.

"""
def estimate_csv_size(gdf):
n = min(100, len(gdf))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would update 100 when we have the information of which value is enough for the precision we want for the estimation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but spoiler, that looks like a good number

@@ -130,7 +138,14 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col
elif isinstance(dataframe, GeoDataFrame):
log.warning('Geometry column not found in the GeoDataFrame.')

table_name = context_manager.copy_from(gdf, table_name, if_exists, cartodbfy)
chunk_count = int(math.ceil(estimate_csv_size(gdf) / max_upload_size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove the int( here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally!

@@ -21,6 +21,27 @@
DEFAULT_RETRY_TIMES = 3


def retry_copy(func):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice decorator!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇

norm_table_name = to_carto(gdf, table_name, CREDENTIALS, max_upload_size=100000)

# Then
assert cm_mock.call_count == 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that there are 12 chunks? Could you add a note with the info about where this comes from (max_size and size)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added note 👍

@antoniocarlon
Copy link
Contributor Author

@Jesus89 second CR round?

Copy link
Member

@Jesus89 Jesus89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Jesus89 Jesus89 merged commit 40ca547 into develop Aug 24, 2020
@Jesus89 Jesus89 deleted the chore/ch93723/optimize-upload-to-carto branch August 24, 2020 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants