First upload of points enrichment #989

alejandrohall · 2019-09-16T07:34:26Z

No description provided.

cartoframes/data/enrichment/points_enrichment.py

houndci-bot · 2019-09-16T07:34:29Z

cartoframes/data/enrichment/__init__.py

@@ -0,0 +1 @@
+from .points_enrichment import enrich_points


'.points_enrichment.enrich_points' imported but unused

Make the dog happy :)

simon-contreras-deel

Some comments

simon-contreras-deel · 2019-09-16T10:21:02Z

cartoframes/data/enrichment/__init__.py

@@ -0,0 +1 @@
+from .points_enrichment import enrich_points


Make the dog happy :)

cartoframes/data/enrichment/bigquery_client.py

requirements.txt

cartoframes/data/enrichment/points_enrichment.py

alrocar

First implementation looks good 👍

Let's agree which comments we want to:

Be tackled on this PR.
Create a different issue for next iteration of the enrichment.
Discard.

OTOH I'm missing some tests.

cartoframes/data/enrichment/bigquery_client.py

cartoframes/data/enrichment/points_enrichment.py

alrocar · 2019-09-16T13:54:16Z

cartoframes/data/enrichment/points_enrichment.py

+    bq_client = bigquery_client.BigQueryClient(credentials)
+
+    # Copy dataframe and generate id to join to original data later
+    data_copy = data.copy()


Do we need a full copy? Could we just use the original dataframe? I'm thinking on performance issues...

I have been thinking about this. We can operate against the original dataframe provided by the user and then revert these operations in order to let the data provided by the user in the same state but what happens if the function is stopped in the middle of the operations, the revert operations can not be done so the data provided by will be modified.

mmm, could we do kind of a view of the original dataframe to avoid the copy? I mean, if we don't do the copy, but just select both columns we need data_copy[[data_geom_column, _ENRICHMENT_ID]] + fill the enrichment ID, to a new dataframe,

I'd say this new dataframe is also a reference to the original one, but you didn't do a full copy (not sure about this though)

Not a big deal by now, but just telling to consider.

alrocar · 2019-09-16T13:56:08Z

cartoframes/data/enrichment/points_enrichment.py

+
+    bq_client.upload_dataframe(data_geometry_id_copy, _ENRICHMENT_ID, data_geom_column, data_tablename)
+
+    variables_id = variables['id'].tolist()


As per initial API docs (and also feedback gathered), we want the enrich methods to receive either:

a list of variable IDs or a list of Variable objects

a list of dataset IDs or a list of Dataset objects

We are rethinking the glue between the catalog and this API, so let's keep it as it is by now.

Related issue

alrocar · 2019-09-16T13:56:37Z

cartoframes/data/enrichment/points_enrichment.py

+    return data_augmentated
+
+
+def __process_enrichment_variables(variables):


Add a test for this method

alrocar · 2019-09-16T13:56:55Z

cartoframes/data/enrichment/points_enrichment.py

+    return table_to_variables
+
+
+def __get_name_geotable_from_datatable(datatable):


Add a test for this method.

cartoframes/data/enrichment/points_enrichment.py

alrocar · 2019-09-16T14:01:47Z

cartoframes/data/enrichment/points_enrichment.py

+        FROM `carto-do-customers.{user_workspace}.{enrichment_table}` enrichment_table
+        JOIN `carto-do-customers.{user_workspace}.{enrichment_geo_table}` enrichment_geo_table
+          ON enrichment_table.geoid = enrichment_geo_table.geoid
+        JOIN `{working_project}.{working_dataset}.{data_table}` data_table


mmm, are we uploading the user data to a different project/dataset than the user dataset? In authorization time we are granting write permissions over the user dataset specifically to allow uploading data, but not sure what's the best option.

Writing on this shared dataset would mean any user would have permissions to write other users tables?

Yes, we should talk about this, I think the best way is to create a new project and a dataset for every user who tries to upload something, so we can give permissions to this dataset too. But this dataset needs to be created by the people who are creating the authorize endpoint because if you want to give permissions to a dataset this needs to be created already.

I think the best way is to create a new project and a dataset for every user who tries to upload something

Yep we are already doing that. The service account (or token) has write permissions on carto-do-customers.{username} so I think it's safe to upload temp tables there.

See related issue

alrocar · 2019-09-16T14:23:21Z

cartoframes/data/enrichment/points_enrichment.py

+        SELECT data_table.{enrichment_id},
+              {variables},
+              ST_Area(enrichment_geo_table.geom) AS area,
+              NULL AS population


is this needed?

alrocar

Some more comments.

Still missing some tests. Since we are going to keep working on this we can postpone them for future PRs, but adding here a short list to take into account:

Unit tests of the private methods in points_enrichment.py
Some e2e test of the enrichment function
Some other tests mocking errors and exceptions.

alrocar · 2019-09-17T17:28:49Z

cartoframes/data/enrichment/points_enrichment.py

-    data_geometry_id_copy = data_copy[[_ENRICHMENT_ID, data_geom_column]]
+    data_copy = __copy_data_and_generate_enrichment_id(data, _ENRICHMENT_ID)
+
+    # Select only geometry and id and build schema


Let's try to avoid comments unless they say anything else than the code they refer to. If the code is clean and readable enough (and for me this is the case) they are redundant :). So we can get rid of comments in this method.

alrocar · 2019-09-17T17:32:58Z

cartoframes/data/enrichment/points_enrichment.py

+        FROM `carto-do-customers.{user_workspace}.{enrichment_table}` enrichment_table
+        JOIN `carto-do-customers.{user_workspace}.{enrichment_geo_table}` enrichment_geo_table
+          ON enrichment_table.geoid = enrichment_geo_table.geoid
+        JOIN `{working_project}.{working_dataset}.{data_table}` data_table


I think the best way is to create a new project and a dataset for every user who tries to upload something

Yep we are already doing that. The service account (or token) has write permissions on carto-do-customers.{username} so I think it's safe to upload temp tables there.

See related issue

alrocar · 2019-09-17T17:34:04Z

cartoframes/data/enrichment/points_enrichment.py


+    # Data table is a universal unique identifier
    data_tablename = uuid.uuid4().hex


Let's add a prefix or something to the table name to easily identify them, in case we have to do a bulk deletion of those tables.

alrocar · 2019-09-24T04:39:20Z

Closing this one in favour of #1016

First upload of points enrichment

e8841de

alejandrohall requested a review from simon-contreras-deel September 16, 2019 07:34

houndci-bot reviewed Sep 16, 2019

View reviewed changes

Remove trailing whitespaces

9aaa966

simon-contreras-deel reviewed Sep 16, 2019

View reviewed changes

Add point enrichment example notebook

df96d3f

alrocar suggested changes Sep 16, 2019

View reviewed changes

alrocar assigned alejandrohall Sep 16, 2019

alrocar reviewed Sep 16, 2019

View reviewed changes

Apply comments by simon and alrocar

5557083

alrocar reviewed Sep 17, 2019

View reviewed changes

alrocar closed this Sep 24, 2019

Jesus89 deleted the feature/first_stage_enrichment_points branch May 25, 2020 05:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First upload of points enrichment #989

First upload of points enrichment #989

alejandrohall commented Sep 16, 2019

houndci-bot Sep 16, 2019

simon-contreras-deel Sep 16, 2019

simon-contreras-deel left a comment

simon-contreras-deel Sep 16, 2019

alrocar left a comment

alrocar Sep 16, 2019

alejandrohall Sep 17, 2019

alrocar Sep 17, 2019

alrocar Sep 16, 2019 •

edited

Loading

alejandrohall Sep 17, 2019

alrocar Sep 16, 2019

alrocar Sep 16, 2019

alrocar Sep 16, 2019

alejandrohall Sep 17, 2019

alrocar Sep 17, 2019

alrocar Sep 16, 2019

alrocar left a comment

alrocar Sep 17, 2019

alrocar Sep 17, 2019

alrocar Sep 17, 2019

alrocar commented Sep 24, 2019


		bq_client.upload_dataframe(data_geometry_id_copy, _ENRICHMENT_ID, data_geom_column, data_tablename)

		variables_id = variables['id'].tolist()

		return data_augmentated


		def __process_enrichment_variables(variables):

		return table_to_variables


		def __get_name_geotable_from_datatable(datatable):


		# Data table is a universal unique identifier
		data_tablename = uuid.uuid4().hex

First upload of points enrichment #989

First upload of points enrichment #989

Conversation

alejandrohall commented Sep 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simon-contreras-deel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar Sep 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alrocar commented Sep 24, 2019

alrocar Sep 16, 2019 •

edited

Loading