Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
examples Adds 'tags: [cloud-builders-community]' to all builders (#261) Jul 10, 2019
Dockerfile Add dataflow-python builder Mar 20, 2018
README.md
cloudbuild.yaml Adds 'tags: [cloud-builders-community]' to all builders (#261) Jul 10, 2019

README.md

Dataflow Python builder

Introduction

Google Cloud Dataflow, based on Apache Beam, is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness. Developers and Data Scientists use Dataflow to process large amounts of data without managing complex cluster infrastructure.

Google Cloud Build offers a number of advantages for Cloud Dataflow developers:

  • Small workloads which run in a n1-standard-1 virtual machine can take advantage of the free tier, which provides 120 free build-minutes per day
  • Workflows start very quickly, typically within a few seconds (depending on the size of your container)
  • Pipelines get all the benefits of containerization, including a consistent environment and integration with your CI/CD flow
  • Cloud Build supports automatic triggering from Github, Bitbucket and Google Cloud Source Repositories, so you can configure your data warehouse to automatically update when the pipeline code changes
  • Pipelines can be initiated by a simple REST API.

The builder supports both Dataflow execution modes:

  • DirectRunner runs your code in-process inside Cloud Build, taking advantage of the fast start and free tier pricing
  • DataflowRunner starts workers on Compute Engine, allowing for massive scalability.

This builder supports the Cloud Dataflow Python API.

Usage

If this is your first time using Cloud Build, follow the Quickstart for Docker to get started.

Then, clone this code and build the builder:

gcloud builds submit --config=cloudbuild.yaml .

To access resources on Google Compute Platform from your pipeline - whether Cloud Storage, BigQuery datasets, or Dataflow runners - issue the following commands to permission your Cloud Build service account:

# Setup IAM bindings
export PROJECT=$(gcloud info --format='value(config.project)')
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT --format 'value(projectNumber)')
export CB_SA_EMAIL=$PROJECT_NUMBER@cloudbuild.gserviceaccount.com
gcloud projects add-iam-policy-binding $PROJECT --member=serviceAccount:$CB_SA_EMAIL --role='roles/iam.serviceAccountUser' --role='roles/iam.serviceAccountActor' --role='roles/dataflow.admin’
# Enable Dataflow API
gcloud services enable dataflow.googleapis.com
# Setup GCS bucket
gsutil mb gs://cloudbuild-dataflow-$PROJECT

Python notes

Python has several different dependency management tools, which interact in different ways with containers. In this case we use virtualenv to setup an isolated folder inside the container with the libraries we need. As a result of this, be sure that your first build step loads the virtualenv environment. See examples for details.

Additional libraries can be added by creating another container based on this one, for example:

FROM gcr.io/my-project/dataflow-python

RUN /bin/bash -c "source venv/bin/activate"

RUN pip install my-library
...

This container uses Python 2. Python 3 is currently under development: see BEAM-1251.

Examples

For examples, see the examples directory.

You can’t perform that action at this time.