Skip to content

Commit

Permalink
cleanup: Use Composer 2.0.0+ and remove node affinities from Kubernet…
Browse files Browse the repository at this point in the history
…es pods (#251)
  • Loading branch information
adlersantos committed Dec 16, 2021
1 parent aa41dfe commit b95d62c
Show file tree
Hide file tree
Showing 279 changed files with 853 additions and 3,649 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ Cloud-native, data pipeline architecture for onboarding public datasets to [Data

# Requirements
- Python `>=3.6.10,<3.9`. We currently use `3.8`. For more info, see the [Cloud Composer version list](https://cloud.google.com/composer/docs/concepts/versioning/composer-versions).
- Familiarity with [Apache Airflow](https://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html) (>=v2.1)
- Familiarity with [Apache Airflow](https://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html) (`>=v2.1.0`)
- [pipenv](https://pipenv-fork.readthedocs.io/en/latest/install.html#installing-pipenv) for creating similar Python environments via `Pipfile.lock`
- [gcloud](https://cloud.google.com/sdk/gcloud) command-line tool with Google Cloud Platform credentials configured. Instructions can be found [here](https://cloud.google.com/sdk/docs/initializing).
- [Terraform](https://learn.hashicorp.com/tutorials/terraform/install-cli) `>=v0.15.1`
- [Google Cloud Composer](https://cloud.google.com/composer/docs/concepts/overview) environment running [Apache Airflow](https://airflow.apache.org/docs/apache-airflow/stable/concepts.html) `>=2.0`. To create a new Cloud Composer environment, see [this guide](https://cloud.google.com/composer/docs/how-to/managing/creating).
- [Google Cloud Composer](https://cloud.google.com/composer/docs/concepts/overview) environment running [Apache Airflow](https://airflow.apache.org/docs/apache-airflow/stable/concepts.html) `>=2.1.0` and Cloud Composer `>=2.0.0`. To create a new Cloud Composer environment, see [this guide](https://cloud.google.com/composer/docs/how-to/managing/creating).

# Environment Setup

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,24 +36,8 @@
austin_bikeshare_stations_transform_csv = kubernetes_pod_operator.KubernetesPodOperator(
task_id="austin_bikeshare_stations_transform_csv",
name="bikeshare_stations",
namespace="default",
affinity={
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "cloud.google.com/gke-nodepool",
"operator": "In",
"values": ["pool-e2-standard-4"],
}
]
}
]
}
}
},
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.austin_bikeshare.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
15 changes: 2 additions & 13 deletions datasets/austin_bikeshare/bikeshare_stations/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -49,19 +49,8 @@ dag:

# The name of the pod in which the task will run. This will be used (plus a random suffix) to generate a pod id
name: "bikeshare_stations"

# The namespace to run within Kubernetes. Always set its value to "default" because we follow the guideline that KubernetesPodOperator will only be used for very light workloads, i.e. use the Cloud Composer environment's resources without starving other pipelines.
namespace: "default"

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"

Expand Down
2 changes: 0 additions & 2 deletions datasets/austin_bikeshare/dataset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,6 @@ dataset:
# A list of terms and conditions that users of the dataset should agree on,
# using the YAML list syntax.
terms_of_use: ~


resources:
# A list of Google Cloud resources needed by your dataset. In principle, all
# pipelines under a dataset should be able to share these resources.
Expand Down
20 changes: 2 additions & 18 deletions datasets/austin_crime/crime/crime_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,24 +36,8 @@
austin_crime_transform_csv = kubernetes_pod_operator.KubernetesPodOperator(
task_id="austin_crime_transform_csv",
name="crime",
namespace="default",
affinity={
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "cloud.google.com/gke-nodepool",
"operator": "In",
"values": ["pool-e2-standard-4"],
}
]
}
]
}
}
},
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.austin_crime.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
13 changes: 3 additions & 10 deletions datasets/austin_crime/crime/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,9 @@ dag:
args:
task_id: "austin_crime_transform_csv"
name: "crime"
namespace: "default"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"
image: "{{ var.json.austin_crime.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down
2 changes: 0 additions & 2 deletions datasets/austin_crime/dataset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,6 @@ dataset:
description: Austin Crime dataset
dataset_sources: ~
terms_of_use: ~


resources:
- type: bigquery_dataset
dataset_id: austin_crime
Expand Down
2 changes: 0 additions & 2 deletions datasets/austin_waste/dataset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ dataset:
description: austin-based datasets
dataset_sources: ~
terms_of_use: ~


resources:
- type: bigquery_dataset
dataset_id: austin_waste
Expand Down
3 changes: 2 additions & 1 deletion datasets/austin_waste/waste_and_diversion/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,8 @@ dag:
task_id: "austin_waste_transform_csv"
startup_timeout_seconds: 600
name: "austin_waste"
namespace: "default"
namespace: "composer"
service_account_name: "datasets"
image_pull_policy: "Always"
image: "{{ var.json.austin_waste.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@
task_id="austin_waste_transform_csv",
startup_timeout_seconds=600,
name="austin_waste",
namespace="default",
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.austin_waste.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
5 changes: 5 additions & 0 deletions datasets/bls/_terraform/bls_dataset.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,11 @@ output "bigquery_dataset-bls-dataset_id" {
resource "google_storage_bucket" "bls" {
name = "${var.bucket_name_prefix}-bls"
force_destroy = true
lifecycle {
ignore_changes = [
logging,
]
}
}

output "storage_bucket-bls-name" {
Expand Down
20 changes: 2 additions & 18 deletions datasets/bls/c_cpi_u/c_cpi_u_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,24 +37,8 @@
task_id="transform_csv",
startup_timeout_seconds=600,
name="c_cpi_u",
namespace="default",
affinity={
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "cloud.google.com/gke-nodepool",
"operator": "In",
"values": ["pool-e2-standard-4"],
}
]
}
]
}
}
},
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.bls.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
13 changes: 3 additions & 10 deletions datasets/bls/c_cpi_u/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -41,16 +41,9 @@ dag:
task_id: "transform_csv"
startup_timeout_seconds: 600
name: "c_cpi_u"
namespace: "default"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"
image: "{{ var.json.bls.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down
20 changes: 2 additions & 18 deletions datasets/bls/cpi_u/cpi_u_dag.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,24 +37,8 @@
task_id="transform_csv",
startup_timeout_seconds=600,
name="cpi_u",
namespace="default",
affinity={
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "cloud.google.com/gke-nodepool",
"operator": "In",
"values": ["pool-e2-standard-4"],
}
]
}
]
}
}
},
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.bls.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
13 changes: 3 additions & 10 deletions datasets/bls/cpi_u/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,9 @@ dag:
task_id: "transform_csv"
startup_timeout_seconds: 600
name: "cpi_u"
namespace: "default"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"
image: "{{ var.json.bls.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down
2 changes: 0 additions & 2 deletions datasets/bls/dataset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,6 @@ dataset:
description: ~
dataset_sources: ~
terms_of_use: ~


resources:
- type: bigquery_dataset
dataset_id: bls
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,24 +37,8 @@
task_id="transform_csv",
startup_timeout_seconds=600,
name="employment_hours_earnings",
namespace="default",
affinity={
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "cloud.google.com/gke-nodepool",
"operator": "In",
"values": ["pool-e2-standard-4"],
}
]
}
]
}
}
},
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.bls.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
13 changes: 3 additions & 10 deletions datasets/bls/employment_hours_earnings/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,9 @@ dag:
task_id: "transform_csv"
startup_timeout_seconds: 600
name: "employment_hours_earnings"
namespace: "default"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"
image: "{{ var.json.bls.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,24 +37,8 @@
task_id="transform_csv",
startup_timeout_seconds=600,
name="employment_hours_earnings_series",
namespace="default",
affinity={
"nodeAffinity": {
"requiredDuringSchedulingIgnoredDuringExecution": {
"nodeSelectorTerms": [
{
"matchExpressions": [
{
"key": "cloud.google.com/gke-nodepool",
"operator": "In",
"values": ["pool-e2-standard-4"],
}
]
}
]
}
}
},
namespace="composer",
service_account_name="datasets",
image_pull_policy="Always",
image="{{ var.json.bls.container_registry.run_csv_transform_kub }}",
env_vars={
Expand Down
13 changes: 3 additions & 10 deletions datasets/bls/employment_hours_earnings_series/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,9 @@ dag:
task_id: "transform_csv"
startup_timeout_seconds: 600
name: "employment_hours_earnings_series"
namespace: "default"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"
image: "{{ var.json.bls.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down
13 changes: 3 additions & 10 deletions datasets/bls/unemployment_cps/pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,9 @@ dag:
task_id: "transform_csv"
startup_timeout_seconds: 600
name: "unemployment_cps"
namespace: "default"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.google.com/gke-nodepool
operator: In
values:
- "pool-e2-standard-4"
namespace: "composer"
service_account_name: "datasets"

image_pull_policy: "Always"
image: "{{ var.json.bls.container_registry.run_csv_transform_kub }}"
env_vars:
Expand Down

0 comments on commit b95d62c

Please sign in to comment.