# Google Cloud Dataflow

Table of contents:

- [Beam Programming](#beam-programming)
- [Building a Simple Data Transformation Pipeline](#building-a-simple-data-transformation-pipeline)


Google Cloud Dataflow provides a serverless, parallel and distributed infrastructure for running jobs for batch and stream data processing.
One of the core strengths of Dataflow is its ability to almost seamlessly handle the switch from processing of batch historical data to streaming datasets while elegantly taking into consideration the perks of streaming processing such as windowing.
Dataflow is a major component for building an end-to-end ML production pipeline on GCP.

<a id="beam-programming"></a>

## Beam Programming
Apache Beam provides a set of broad concepts to simplify the process of building a transformation pipeline for distributed batch and stream jobs.

- **A Pipeline:** A Pipeline object wraps the entire operation and prescribes the transformation process by defining the input data source to the pipeline, how that data will be transformed and where the data will be written.
- **A PCollection:** A PCollection is used to define a data source. The data source can either be bounded or unbounded. A bounded data source referes to batch or historical data, whereas an unbounded data source refers to streaming data.
- **A PTransform:** PTransforms refers to a particular transformation task carried out on one or more PCollections in the pipeline. A number of core Beam transforms include:
  - ParDo: for parallel processing.
  - GroupByKey: for processing collections of key/value pairs.
  - CoGroupByKey: for a relational join of two or more key/value PCollections with the same key type.
  - Combine: for combining collections of elements or values in your data.
  - Flatten: for merging multiple PCollection objects.
  - Partition: splits a single PCollection into smaller collections. 
- **I/O Transforms:** These are PTransforms that read or write data to different external storage systems.

<div style="display: inline-block;width: 100%;">
<img src="ieee-ompi/dataflow-sequential-transform.png" style="float:left;" alt="A Simple Linear Pipeline with Sequential Transforms." height=90% width=90% />
</div>

<a id="building-a-simple-data-transformation-pipeline"></a>

## Building a Simple Data Transformation Pipeline

In [3]:
# authenticate GCP account
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&prompt=select_account&response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform&access_type=offline



Credentials saved to file: [/Users/ekababisong/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests
Application Default Credentials.

To generate an access token for other uses, run:
  gcloud auth application-default print-access-token


In [5]:
# install the apache beam library and other important setup packages.
!source activate py2env
!pip uninstall -y google-cloud-dataflow
!conda install -y pytz==2018.4
!pip install apache-beam[gcp]

/bin/sh: activate: No such file or directory
[33mSkipping google-cloud-dataflow as it is not installed.[0m
Collecting package metadata: done
Solving environment: done


  current version: 4.6.1
  latest version: 4.6.2

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /Users/ekababisong/anaconda3/envs/pydl

  added / updated specs:
    - pytz==2018.4


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    pytz-2018.4                |           py36_0         212 KB
    ------------------------------------------------------------
                                           Total:         212 KB

The following NEW packages will be INSTALLED:

  pip                pkgs/main/osx-64::pip-18.1-py36_0

The following packages will be UPDATED:

  certifi                                  2018.8.24-py35_1 --> 2018.11.29-py36_0
  o

[31mtwine 1.12.1 requires pkginfo>=1.4.2, which is not installed.[0m
[31mtwine 1.12.1 requires requests-toolbelt>=0.8.0, which is not installed.[0m
[31mtwine 1.12.1 requires tqdm>=4.14, which is not installed.[0m
[31mreadme-renderer 24.0 requires bleach>=2.1.0, which is not installed.[0m
[31mreadme-renderer 24.0 requires docutils>=0.13.1, which is not installed.[0m
[31mreadme-renderer 24.0 requires Pygments, which is not installed.[0m
[31mtf-nightly 1.13.0.dev20190126 has requirement protobuf>=3.6.1, but you'll have protobuf 3.3.0 which is incompatible.[0m
[31mtb-nightly 1.13.0a20190126 has requirement protobuf>=3.4.0, but you'll have protobuf 3.3.0 which is incompatible.[0m
[31mgoogleapis-common-protos 1.5.6 has requirement protobuf>=3.6.0, but you'll have protobuf 3.3.0 which is incompatible.[0m
Installing collected packages: avro, crcmod, dill, httplib2, six, pbr, mock, pyasn1-modules, rsa, oauth2client, protobuf, pyyaml, typing, google-apitools, googleapis-common-

In [6]:
# import relevant libraries
import apache_beam as beam
from apache_beam.io import ReadFromText
from apache_beam.io import WriteToText

ImportError: No module named 'apache_beam'