Skip to content

Commit

Permalink
Merge branch 'main' into new_york_taxi_trips
Browse files Browse the repository at this point in the history
  • Loading branch information
nlarge-google committed Aug 16, 2022
2 parents a37c32d + 13a829f commit 018de83
Show file tree
Hide file tree
Showing 222 changed files with 21,385 additions and 30,135 deletions.
24 changes: 17 additions & 7 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,27 @@ Note: If you are adding or editing a dataset, please specify the dataset folder

## Checklist

Note: If an item applies to you, all of its sub-items must be fulfilled

- [ ] **(Required)** This pull request is appropriately labeled
- [ ] Please merge this pull request after it's approved

Use the sections below based on what's applicable to your PR and delete the rest:

### Feature
- [ ] I'm adding or editing a feature
- [ ] I have updated the [`README`](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/blob/main/README.md) accordingly
- [ ] I have added tests for the feature
- [ ] I have updated the [`README`](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/blob/main/README.md) accordingly
- [ ] I have added/revised tests for the feature

### Data Onboarding
- [ ] I'm adding or editing a dataset
- [ ] The [Google Cloud Datasets team](mailto:cloud-datasets-onboarding@google.com) is aware of the proposed dataset
- [ ] I put all my code inside `datasets/<DATASET_NAME>` and nothing outside of that directory
- [ ] The [Google Cloud Datasets team](mailto:cloud-datasets-onboarding@google.com) is aware of the proposed dataset
- [ ] I put all my code inside `datasets/<DATASET_NAME>` and nothing outside of that directory

### Documentation
- [ ] I'm adding/editing documentation

### Bug fix
- [ ] I'm submitting a bugfix
- [ ] I have added tests to my bugfix (see the [`tests`](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/tree/main/tests) folder)
- [ ] I have added/revised tests related to my bugfix (see the [`tests`](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/tree/main/tests) folder)

### Code cleanup or refactoring
- [ ] I'm refactoring or cleaning up some code
2 changes: 1 addition & 1 deletion .github/workflows/unit-tests-airflow1.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.8]
python-version: [3.8.12]
steps:
- uses: actions/checkout@v2
- uses: hashicorp/setup-terraform@v2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/unit-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.8]
python-version: [3.8.12]
steps:
- uses: actions/checkout@v2
- uses: hashicorp/setup-terraform@v2
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ repos:
hooks:
- id: check-yaml
- repo: https://github.com/psf/black
rev: 20.8b1
rev: '22.3.0'
hooks:
- id: black
name: black
Expand Down
79 changes: 79 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,84 @@
# Changelog

## [5.1.0](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/compare/v5.0.0...v5.1.0) (2022-07-30)


### Features

* Add scaffold script for directory + dataset.yaml setup ([#412](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/412)) ([5bf354b](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/5bf354b1cce3a8cec8c974e13932ddd91a554479))
* Adding a notebook tutorial for the EPA dataset: CO levels ([#422](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/422)) ([f0bab59](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/f0bab59aa11e8314e584413f331d50af66d789f8))
* Adds operators for Cloud SQL, Cloud Functions, and GCE ([#429](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/429)) ([9b5da34](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/9b5da3455e0f9a60d53b3783ff968354a9373025))
* Support `--async-builds` flag for `generate_dag.py` ([#424](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/424)) ([7536df9](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/7536df9aeb3de265a74aceab6f3f843e613714a5))


### Datasets

* Onboard DeepMind AlphaFold DB ([#431](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/431)) ([02c887e](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/02c887ebdf96ba9826d577c307eabd5ac5194aec))
* Onboard CelebA dataset ([#420](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/420)) ([0c28563](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/0c285639b9f8c7fae10581286bb0df86a47eee8c))
* Adds BQ views to `scalable_open_source` dataset ([#416](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/416)) ([2785234](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/2785234639639ba2fbe0bdd06f2f53d258ae6f46))
* Rename co2 columns to emissions to make it generic from Travel Impact Model dataset. ([#418](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/418)) ([e1ac106](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/e1ac1061192e532f2b16d4b59cbfbcb634bde92b))


### Bug Fixes

* Change `cms_medicare` tables with column `provider_zipcode` from integer to string type ([#417](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/417)) ([27b0a9b](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/27b0a9b06a236c88322f2144467fee2986702efd))
* Resolve conflicts on Census Bureau ACS ([#414](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/414)) ([492b973](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/492b973e3215a3c960e06ef061eb0c10de8f8419))
* Resolve CRON value in Cloud Storage Geo Index dataset ([#413](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/413)) ([8903e82](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/8903e82016d3bca5ef1dd83a9915b60c1e620a66))
* Resolve IP error when creating NOAA cluster ([#423](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/423)) ([82d53f4](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/82d53f47f013e76be0054f92f3db32b69ed769f3))
* Use proper GCS prefix for custom data folder ([#408](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/408)) ([9d56363](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/9d563635004542377f1880fae8de2c901a64555a))

## [5.0.0](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/compare/v4.2.0...v5.0.0) (2022-07-11)


### ⚠ BREAKING CHANGES

* Upgrade to Airflow 2.2.5 and Python 3.8.12 (#394)

### Datasets

* Onboard Carbon-Free Energy Calculator dataset ([#391](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/391)) ([f3a9447](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/f3a944703c7d53b2d145ddf370fd861825331726))
* Onboard Census Bureau ACS Dataset ([#399](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/399)) ([98e0179](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/98e01799bd47493f3fee18b8e0075b61ff45b007))
* Onboard Fashion MNIST dataset ([#387](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/387)) ([91b7f6a](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/91b7f6ac71a2f5fb569ce1b0c423d683f6c3c447))
* Onboard IMDb dataset ([#406](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/406)) ([2559838](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/25598380b0353c9bba9b6d08c0164691815d2bc1))
* Optimize tests for DAG and Terraform generation ([#395](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/395)) ([ffcd18c](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/ffcd18cb2e26f30622f3d1f71832d6c164b2b819))
* Remove co2e columns from Travel Impact Model dataset. ([#400](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/400)) ([d7179ce](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/d7179ce9779de978c77c77c96d63fe64e2891e20))


### Bug Fixes

* NOAA - Resolve table field name issue. ([#402](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/402)) ([51860eb](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/51860eb593da8b58b8f9ca69bd30df0e3a506c08))
* Use specific Python version for Airflow 1 tests ([#401](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/401)) ([6fa94a7](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/6fa94a73966440ba26749c539b28a8219b910b60))

## [4.2.0](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/compare/v4.1.1...v4.2.0) (2022-06-25)


### Datasets

* Onboard COVID-19 dataset from The New York Times ([#383](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/383)) ([9aac451](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/9aac4519bd732fa903113846b6438ffe2ab77e5c))
* Onboard NOAA dataset ([#378](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/378)) ([02cc038](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/02cc038bc108ee96f33ff4b476de282cb9341fb9))
* Onboard San Jose Translation dataset ([#377](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/377)) ([63ea9b9](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/63ea9b9e62bf74f1ebb9e70f52cb8a5298e85753))
* Onboarding MIMIC-III dataset ([#389](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/389)) ([baf6b8d](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/baf6b8d2535f11743de30b462b84811070d50857))
* [datasets/gbif] Add a query to uncover species found in one region only ([#388](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/388)) ([bd5a135](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/bd5a135ef3d67374ea15ebbb6a9f29472fedf79a))


### Features

* Manage local and remote Airflow variables during deployment ([#392](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/392)) ([f26db3a](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/f26db3a30806b17e8386e72b440a7909022f798a))

## [4.1.1](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/compare/v4.1.0...v4.1.1) (2022-06-16)


### Datasets

* Onboard IMDB dataset ([#382](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/382)) ([8bf7065](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/8bf7065699ef9519778e640c5168fc876e1f8081))
* Onboard MNIST dataset ([#379](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/379)) ([9809935](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/9809935936866332e425f278ec97b935bdf4a65d))
* Onboard New York Taxi Trips dataset ([#381](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/381)) ([897ac3f](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/897ac3f3352e5dc40e08c5c80e223f877a406394))


### Bug Fixes

* Fixed variable reference to container images for New York dataset ([#380](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/issues/380)) ([e4a6718](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/commit/e4a671850d0eb2512f904d508d739c1aefa16e8c))

## [4.1.0](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/compare/v4.0.0...v4.1.0) (2022-06-10)


Expand Down
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,10 @@
# Google Cloud Datasets: Data Pipelines and Documentation Set

This repository contains the followings:
![public-datasets-pipelines](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/blob/main/images/architecture.png)

This repository contains the following:

- Cloud-native, data pipeline architecture for onboarding public datasets to [Google Cloud Datasets](https://cloud.google.com/datasets).
- Documentation set for tutorials, samples, and other articles related to the datasets hosted by the program.
- Documentation set containing tutorials, samples, and other articles making use of the datasets hosted by the program.

For detailed documentation, please see the [Wiki Pages](https://github.com/GoogleCloudPlatform/public-datasets-pipelines/wiki).
28 changes: 28 additions & 0 deletions datasets/celeba/infra/provider.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/**
* Copyright 2021 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/


provider "google" {
project = var.project_id
impersonate_service_account = var.impersonating_acct
region = var.region
}

data "google_client_openid_userinfo" "me" {}

output "impersonating-account" {
value = data.google_client_openid_userinfo.me.email
}
26 changes: 26 additions & 0 deletions datasets/celeba/infra/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/**
* Copyright 2021 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/


variable "project_id" {}
variable "bucket_name_prefix" {}
variable "impersonating_acct" {}
variable "region" {}
variable "env" {}
variable "iam_policies" {
default = {}
}

44 changes: 44 additions & 0 deletions datasets/celeba/pipelines/celeba/celeba_dag.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


from airflow import DAG
from airflow.providers.google.cloud.transfers import gcs_to_gcs

default_args = {
"owner": "Google",
"depends_on_past": False,
"start_date": "2022-07-22",
}


with DAG(
dag_id="celeba.celeba",
default_args=default_args,
max_active_runs=1,
schedule_interval="@once",
catchup=False,
default_view="graph",
) as dag:

# Transfer data from source to destination in GCS
GCStoGCS_transfer = gcs_to_gcs.GCSToGCSOperator(
task_id="GCStoGCS_transfer",
source_bucket="{{ var.value.composer_bucket }}",
source_object="{{ var.json.celeba.source_object }}",
destination_bucket="{{ var.value.composer_bucket }}",
destination_object="{{ var.json.celeba.destination_object }}",
)

GCStoGCS_transfer
42 changes: 42 additions & 0 deletions datasets/celeba/pipelines/celeba/pipeline.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

---
resources: ~

dag:
airflow_version: 2
initialize:
dag_id: celeba
default_args:
owner: "Google"
depends_on_past: False
start_date: "2022-07-22"
max_active_runs: 1
schedule_interval: "@once"
catchup: False
default_view: graph

tasks:
- operator: "GoogleCloudStorageToGoogleCloudStorageOperator"
description: "Transfer data from source to destination in GCS"
args:
task_id: "GCStoGCS_transfer"
source_bucket: "{{ var.value.composer_bucket }}"
source_object: "{{ var.json.celeba.source_object }}"
destination_bucket: "{{ var.value.composer_bucket }}"
destination_object: "{{ var.json.celeba.destination_object }}"

graph_paths:
- "GCStoGCS_transfer"
23 changes: 23 additions & 0 deletions datasets/celeba/pipelines/dataset.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

dataset:
name: celeba
friendly_name: Celebrity Attributes
description: Dataset of images of the facial attributes of various celebrities.
dataset_sources: ~
terms_of_use: ~


resources: ~
11 changes: 3 additions & 8 deletions datasets/census_bureau_acs/infra/cbsa_2019_1yr_pipeline.tf
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,10 @@


resource "google_bigquery_table" "census_bureau_acs_cbsa_2019_1yr" {
project = var.project_id
dataset_id = "census_bureau_acs"
table_id = "cbsa_2019_1yr"

project = var.project_id
dataset_id = "census_bureau_acs"
table_id = "cbsa_2019_1yr"
description = "CBSA 2019 1 year report table"




depends_on = [
google_bigquery_dataset.census_bureau_acs
]
Expand Down
11 changes: 3 additions & 8 deletions datasets/census_bureau_acs/infra/cbsa_2019_5yr_pipeline.tf
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,10 @@


resource "google_bigquery_table" "census_bureau_acs_cbsa_2019_5yr" {
project = var.project_id
dataset_id = "census_bureau_acs"
table_id = "cbsa_2019_5yr"

project = var.project_id
dataset_id = "census_bureau_acs"
table_id = "cbsa_2019_5yr"
description = "CBSA 2019 5 years report table"




depends_on = [
google_bigquery_dataset.census_bureau_acs
]
Expand Down
16 changes: 0 additions & 16 deletions datasets/census_bureau_acs/infra/census_bureau_acs_dataset.tf
Original file line number Diff line number Diff line change
Expand Up @@ -24,19 +24,3 @@ resource "google_bigquery_dataset" "census_bureau_acs" {
output "bigquery_dataset-census_bureau_acs-dataset_id" {
value = google_bigquery_dataset.census_bureau_acs.dataset_id
}

resource "google_storage_bucket" "census-bureau-acs" {
name = "${var.bucket_name_prefix}-census-bureau-acs"
force_destroy = true
location = "US"
uniform_bucket_level_access = true
lifecycle {
ignore_changes = [
logging,
]
}
}

output "storage_bucket-census-bureau-acs-name" {
value = google_storage_bucket.census-bureau-acs.name
}

0 comments on commit 018de83

Please sign in to comment.