Upload table using to_carto in chunks #1676

antoniocarlon · 2020-08-21T09:52:47Z

Upload table using to_carto in chunks. The estimation for the file size is performed extracting a sample of 100 rows of the dataframe and computing the actual size of the exported file using the same mechanism of the actual export to CSV (WKB and so on)
Added retry decorator for the _copy_to and _copy_from functions
Added tests for the new functionalities

Jesus89

Looks great!

I just add some comments about missing docs and perf improvements.

Jesus89 · 2020-08-24T10:10:29Z

cartoframes/io/carto.py

@@ -93,6 +96,11 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col
        ValueError: if the dataframe or table name provided are wrong or the if_exists param is not valid.

    """
+    def estimate_csv_size(gdf):


I would move estimate_csv_size to utils.utils, also compute_copy_data because we are using it in different places.

I haven't moved it to utils.utils because that would mean that utils.utils would depend on io.managers.context_manager and as that file already depends on utils.utils that would create a circular dependency

Oh, I'll move it to utils.columns then or at least out of the to_carto function.

The utils.columns file has the same problem with the circular dependencies.

I have tried extracting the method and all of it's dependencies to utils.utils or utils.columns but in that case I find circular dependencies between utils.utils and utils.columns

Finally I have extracted the estimate_csv_size method to the carto.py file but outside the to_carto method

cartoframes/io/carto.py

Jesus89 · 2020-08-24T10:21:24Z

cartoframes/io/carto.py

+    def estimate_csv_size(gdf):
+        n = min(100, len(gdf))
+        return len(''.join([x.decode("utf-8") for x in
+                            _compute_copy_data(gdf.sample(n=n), get_dataframe_columns_info(gdf))])) * len(gdf) / n


I would extract this get_dataframe_columns_info(gdf) to a variable columns_info before the loop.

Also, we could try something like sum([len(x) for x in ...]) to avoid the decode and join so it will be a bit faster.

nice catch, on it!

Jesus89 · 2020-08-24T10:22:21Z

cartoframes/io/carto.py

@@ -93,6 +96,11 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col
        ValueError: if the dataframe or table name provided are wrong or the if_exists param is not valid.

    """
+    def estimate_csv_size(gdf):
+        n = min(100, len(gdf))


I would update 100 when we have the information of which value is enough for the precision we want for the estimation

Yes, but spoiler, that looks like a good number

Jesus89 · 2020-08-24T10:22:58Z

cartoframes/io/carto.py

@@ -130,7 +138,14 @@ def to_carto(dataframe, table_name, credentials=None, if_exists='fail', geom_col
    elif isinstance(dataframe, GeoDataFrame):
        log.warning('Geometry column not found in the GeoDataFrame.')

-    table_name = context_manager.copy_from(gdf, table_name, if_exists, cartodbfy)
+    chunk_count = int(math.ceil(estimate_csv_size(gdf) / max_upload_size))


I think we can remove the int( here.

Jesus89 · 2020-08-24T10:23:28Z

cartoframes/io/managers/context_manager.py

@@ -21,6 +21,27 @@
 DEFAULT_RETRY_TIMES = 3


+def retry_copy(func):


Nice decorator!

Jesus89 · 2020-08-24T10:26:31Z

tests/unit/io/test_carto.py

+    norm_table_name = to_carto(gdf, table_name, CREDENTIALS, max_upload_size=100000)
+
+    # Then
+    assert cm_mock.call_count == 12


Does this mean that there are 12 chunks? Could you add a note with the info about where this comes from (max_size and size)?

Added note 👍

antoniocarlon · 2020-08-24T11:04:59Z

@Jesus89 second CR round?

Jesus89

LGTM!

Upload table using to_carto in chunks

f808620

antoniocarlon requested a review from Jesus89 August 21, 2020 09:52

antoniocarlon added 6 commits August 24, 2020 10:15

Added retry copy decorator

ad0bcbf

Added tests

9bfc22c

Tests chunks

6dcd268

Fix test

33ebf64

Fix test

9aee2ce

Fix test

8214e62

Jesus89 requested changes Aug 24, 2020

View reviewed changes

antoniocarlon added 6 commits August 24, 2020 12:34

Added retry_times to the to_carto function docstring

7deed89

Precalculate the columns

8880794

Improved size calculations

bbbf0e1

Removed cast

a23fa56

Added note about the chunks in the tests

dea97b0

Extract the estimate_csv_size from to_carto

728db60

Jesus89 approved these changes Aug 24, 2020

View reviewed changes

Jesus89 merged commit 40ca547 into develop Aug 24, 2020

Jesus89 deleted the chore/ch93723/optimize-upload-to-carto branch August 24, 2020 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload table using to_carto in chunks #1676

Upload table using to_carto in chunks #1676

antoniocarlon commented Aug 21, 2020 •

edited

Jesus89 left a comment

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020 •

edited

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020

Jesus89 Aug 24, 2020

antoniocarlon Aug 24, 2020

antoniocarlon commented Aug 24, 2020

Jesus89 left a comment

		@@ -21,6 +21,27 @@
		DEFAULT_RETRY_TIMES = 3


		def retry_copy(func):

Upload table using to_carto in chunks #1676

Upload table using to_carto in chunks #1676

Conversation

antoniocarlon commented Aug 21, 2020 • edited

Jesus89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoniocarlon Aug 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoniocarlon commented Aug 24, 2020

Jesus89 left a comment

Choose a reason for hiding this comment

antoniocarlon commented Aug 21, 2020 •

edited

antoniocarlon Aug 24, 2020 •

edited