Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Create ml_datasets_uscentral1 for penguins table #204

Merged
merged 8 commits into from
Oct 8, 2021

Conversation

adlersantos
Copy link
Member

@adlersantos adlersantos commented Oct 8, 2021

Description

Note: This PR is based out of a feature branch (#203) that now supports adding alternative BQ datasets.

BQ datasets are location-specific, but we (internally) need an ml_datasets_uscentral1 for upcoming ML guides and tutorials. This PR adds that dataset, which loads the same penguins table under it.

Checklist

  • Please merge this PR for me once it is approved.
  • If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.
  • If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/<YOUR-DATASET> and nothing outside of that directory.
  • This PR is appropriately labeled.

@adlersantos adlersantos added cleanup Cleanup or refactor code data onboarding Onboard a dataset or submit a pipeline labels Oct 8, 2021
@google-cla google-cla bot added the cla: yes label Oct 8, 2021
@adlersantos
Copy link
Member Author

cc @ivanmkc

@leahecole
Copy link
Contributor

LGTM but @tswast should probably be final reviewer since he has additional context

Base automatically changed from bq-dataset-namespacing to main October 8, 2021 20:09
Copy link
Contributor

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Technically the GCS-to-GCS step is not needed because the samples data bucket already has a copy in gs://cloud-samples-data-us-central1, but since we want this to also be useful as a potential example for future data sources where that is not the case I think it makes sense to keep it.

@adlersantos
Copy link
Member Author

@tswast Thanks a lot!

I had to add the intermediate GCS-to-GCS step because I started getting this error on the DAG when loading to ml_datasets_uscentral1:

google.api_core.exceptions.BadRequest: 400 
Cannot read and write in different locations: source: US, destination: us-central1

But yes, this can be a reference pattern that others who are encountering the same issue can refer to.

@adlersantos adlersantos merged commit 072aa87 into main Oct 8, 2021
@adlersantos adlersantos deleted the ml-datasets-region branch October 8, 2021 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes cleanup Cleanup or refactor code data onboarding Onboard a dataset or submit a pipeline
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants