chore: Create `ml_datasets_uscentral1` for `penguins` table #204

adlersantos · 2021-10-08T19:14:19Z

Description

Note: This PR is based out of a feature branch (#203) that now supports adding alternative BQ datasets.

BQ datasets are location-specific, but we (internally) need an ml_datasets_uscentral1 for upcoming ML guides and tutorials. This PR adds that dataset, which loads the same penguins table under it.

Checklist

Please merge this PR for me once it is approved.
If this PR adds or edits a dataset or pipeline, it was reviewed and approved by the Google Cloud Public Datasets team beforehand.
If this PR adds or edits a dataset or pipeline, I put all my code inside datasets/<YOUR-DATASET> and nothing outside of that directory.
This PR is appropriately labeled.

…set IDs

adlersantos · 2021-10-08T19:14:33Z

cc @ivanmkc

leahecole · 2021-10-08T20:04:02Z

LGTM but @tswast should probably be final reviewer since he has additional context

tswast

LGTM

Technically the GCS-to-GCS step is not needed because the samples data bucket already has a copy in gs://cloud-samples-data-us-central1, but since we want this to also be useful as a potential example for future data sources where that is not the case I think it makes sense to keep it.

adlersantos · 2021-10-08T20:34:15Z

@tswast Thanks a lot!

I had to add the intermediate GCS-to-GCS step because I started getting this error on the DAG when loading to ml_datasets_uscentral1:

google.api_core.exceptions.BadRequest: 400 
Cannot read and write in different locations: source: US, destination: us-central1

But yes, this can be a reference pattern that others who are encountering the same issue can refer to.

adlersantos added 8 commits October 8, 2021 11:45

feat: Prepend dataset ID in TF resource names for BQ tables

89f41f5

feat: Support explicitly specifying the dataset_id for BQ tables

7afe6d7

feat: refactored generate TF tests to setup and test multiple BQ data…

40dd6d1

…set IDs

feat: Creates the BQ dataset ml_dataset_uscentral1

7ddd55c

feat: create a penguins BQ table in ml_datasets_uscentral1 BQ dataset

3771c56

feat: generate TF files

f5e191a

feat: revised penguins DAG to load data to two locations

b79ef60

feat: generate DAG

da17ed1

adlersantos added cleanup Cleanup or refactor code data onboarding Onboard a dataset or submit a pipeline labels Oct 8, 2021

adlersantos requested review from tswast and leahecole October 8, 2021 19:14

google-cla bot added the cla: yes label Oct 8, 2021

Base automatically changed from bq-dataset-namespacing to main October 8, 2021 20:09

tswast approved these changes Oct 8, 2021

View reviewed changes

adlersantos merged commit 072aa87 into main Oct 8, 2021

adlersantos deleted the ml-datasets-region branch October 8, 2021 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: Create `ml_datasets_uscentral1` for `penguins` table #204

chore: Create `ml_datasets_uscentral1` for `penguins` table #204

adlersantos commented Oct 8, 2021 •

edited

adlersantos commented Oct 8, 2021

leahecole commented Oct 8, 2021

tswast left a comment

adlersantos commented Oct 8, 2021

chore: Create ml_datasets_uscentral1 for penguins table #204

chore: Create ml_datasets_uscentral1 for penguins table #204

Conversation

adlersantos commented Oct 8, 2021 • edited

Description

Checklist

adlersantos commented Oct 8, 2021

leahecole commented Oct 8, 2021

tswast left a comment

Choose a reason for hiding this comment

adlersantos commented Oct 8, 2021

chore: Create `ml_datasets_uscentral1` for `penguins` table #204

chore: Create `ml_datasets_uscentral1` for `penguins` table #204

adlersantos commented Oct 8, 2021 •

edited