Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First upload of points enrichment #989

Closed
wants to merge 4 commits into from

Conversation

alejandrohall
Copy link
Contributor

No description provided.

cartoframes/data/enrichment/points_enrichment.py Outdated Show resolved Hide resolved
cartoframes/data/enrichment/points_enrichment.py Outdated Show resolved Hide resolved
@@ -0,0 +1 @@
from .points_enrichment import enrich_points

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'.points_enrichment.enrich_points' imported but unused

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the dog happy :)

Copy link
Contributor

@simon-contreras-deel simon-contreras-deel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments

@@ -0,0 +1 @@
from .points_enrichment import enrich_points
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make the dog happy :)

cartoframes/data/enrichment/bigquery_client.py Outdated Show resolved Hide resolved
requirements.txt Outdated Show resolved Hide resolved
cartoframes/data/enrichment/points_enrichment.py Outdated Show resolved Hide resolved
Copy link
Contributor

@alrocar alrocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First implementation looks good 👍

Let's agree which comments we want to:

  1. Be tackled on this PR.
  2. Create a different issue for next iteration of the enrichment.
  3. Discard.

OTOH I'm missing some tests.

cartoframes/data/enrichment/bigquery_client.py Outdated Show resolved Hide resolved
cartoframes/data/enrichment/points_enrichment.py Outdated Show resolved Hide resolved
bq_client = bigquery_client.BigQueryClient(credentials)

# Copy dataframe and generate id to join to original data later
data_copy = data.copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a full copy? Could we just use the original dataframe? I'm thinking on performance issues...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about this. We can operate against the original dataframe provided by the user and then revert these operations in order to let the data provided by the user in the same state but what happens if the function is stopped in the middle of the operations, the revert operations can not be done so the data provided by will be modified.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, could we do kind of a view of the original dataframe to avoid the copy? I mean, if we don't do the copy, but just select both columns we need data_copy[[data_geom_column, _ENRICHMENT_ID]] + fill the enrichment ID, to a new dataframe,

I'd say this new dataframe is also a reference to the original one, but you didn't do a full copy (not sure about this though)

Not a big deal by now, but just telling to consider.


bq_client.upload_dataframe(data_geometry_id_copy, _ENRICHMENT_ID, data_geom_column, data_tablename)

variables_id = variables['id'].tolist()
Copy link
Contributor

@alrocar alrocar Sep 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per initial API docs (and also feedback gathered), we want the enrich methods to receive either:

  • a list of variable IDs or a list of Variable objects
  • a list of dataset IDs or a list of Dataset objects

We are rethinking the glue between the catalog and this API, so let's keep it as it is by now.

Related issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure!

return data_augmentated


def __process_enrichment_variables(variables):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for this method

return table_to_variables


def __get_name_geotable_from_datatable(datatable):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for this method.

cartoframes/data/enrichment/points_enrichment.py Outdated Show resolved Hide resolved
FROM `carto-do-customers.{user_workspace}.{enrichment_table}` enrichment_table
JOIN `carto-do-customers.{user_workspace}.{enrichment_geo_table}` enrichment_geo_table
ON enrichment_table.geoid = enrichment_geo_table.geoid
JOIN `{working_project}.{working_dataset}.{data_table}` data_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, are we uploading the user data to a different project/dataset than the user dataset? In authorization time we are granting write permissions over the user dataset specifically to allow uploading data, but not sure what's the best option.

Writing on this shared dataset would mean any user would have permissions to write other users tables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should talk about this, I think the best way is to create a new project and a dataset for every user who tries to upload something, so we can give permissions to this dataset too. But this dataset needs to be created by the people who are creating the authorize endpoint because if you want to give permissions to a dataset this needs to be created already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way is to create a new project and a dataset for every user who tries to upload something

Yep we are already doing that. The service account (or token) has write permissions on carto-do-customers.{username} so I think it's safe to upload temp tables there.

See related issue

SELECT data_table.{enrichment_id},
{variables},
ST_Area(enrichment_geo_table.geom) AS area,
NULL AS population
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Contributor

@alrocar alrocar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments.

Still missing some tests. Since we are going to keep working on this we can postpone them for future PRs, but adding here a short list to take into account:

  • Unit tests of the private methods in points_enrichment.py
  • Some e2e test of the enrichment function
  • Some other tests mocking errors and exceptions.

data_geometry_id_copy = data_copy[[_ENRICHMENT_ID, data_geom_column]]
data_copy = __copy_data_and_generate_enrichment_id(data, _ENRICHMENT_ID)

# Select only geometry and id and build schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to avoid comments unless they say anything else than the code they refer to. If the code is clean and readable enough (and for me this is the case) they are redundant :). So we can get rid of comments in this method.

FROM `carto-do-customers.{user_workspace}.{enrichment_table}` enrichment_table
JOIN `carto-do-customers.{user_workspace}.{enrichment_geo_table}` enrichment_geo_table
ON enrichment_table.geoid = enrichment_geo_table.geoid
JOIN `{working_project}.{working_dataset}.{data_table}` data_table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way is to create a new project and a dataset for every user who tries to upload something

Yep we are already doing that. The service account (or token) has write permissions on carto-do-customers.{username} so I think it's safe to upload temp tables there.

See related issue


# Data table is a universal unique identifier
data_tablename = uuid.uuid4().hex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a prefix or something to the table name to easily identify them, in case we have to do a bulk deletion of those tables.

@alrocar
Copy link
Contributor

alrocar commented Sep 24, 2019

Closing this one in favour of #1016

@alrocar alrocar closed this Sep 24, 2019
@Jesus89 Jesus89 deleted the feature/first_stage_enrichment_points branch May 25, 2020 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants