In [1]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# E2E ML on GCP: MLOps stage 1 : data management: get started with Dataflow

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage1/get_started_dataflow.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/ai/platform/notebooks/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage1/get_started_dataflow.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview


This tutorial demonstrates how to use Vertex AI for E2E MLOps on Google Cloud in production. This tutorial covers stage 1 : data management: get started with Dataflow.

### Dataset

The dataset used for this tutorial is the GSOD dataset from [BigQuery public datasets](https://cloud.google.com/bigquery/public-data). The version of the dataset you use here will only use the fields year, month and day to predict the value of mean daily temperature (mean_temp).

### Objective

In this tutorial, you learn how to use `Dataflow` for training with `Vertex AI`.

This tutorial uses the following Google Cloud ML services:

- `Dataflow`
- `BigQuery Datasets`

The steps performed include:

- Offline preprocessing of data:
    - Serially - w/o dataflow
    - Parallel - with dataflow
- Upstream preprocessing of data:
    - tabular data
    - image data

### Recommendations

When doing E2E MLOps on Google Cloud, the following best practices for preprocessing and feeding data during training of custom models:

#### Preprocessing

Data is preprocessed either:

- Offline: The data is preprocessed and stored prior to training.
    - Small datasets: reprocessed and stored when new data.
- Upstream: The data is preprocessed upstream from the model while the data is feed for training.
    - Training on a CPU.
- Downstream: The data is preprocessed downstream in the model while the data is feed for training.
    - Training on a HW accelerator (e.g., GPU/TPU).

#### Model Feeding

Data is feed for model feeding either:

- In-memory: small dataset.
- From disk: large dataset, quick training.
- `Dataflow` from disk: massive dataset, extended training.

#### AutoML

For AutoML training, preprocessing and model feeding are automatically handled.

Alternately for AutoML tabular model training, you can reconfigure the otherwise default preprocessing.

## Installations

Install *one time* the packages for executing the MLOps notebooks.

In [2]:
ONCE_ONLY = False
if ONCE_ONLY:
    ! pip3 install -U tensorflow==2.5 $USER_FLAG
    ! pip3 install -U tensorflow-data-validation==1.2 $USER_FLAG
    ! pip3 install -U tensorflow-transform==1.2 $USER_FLAG
    ! pip3 install -U tensorflow-io==0.18 $USER_FLAG
    ! pip3 install --upgrade google-cloud-aiplatform[tensorboard] $USER_FLAG
    ! pip3 install --upgrade google-cloud-pipeline-components $USER_FLAG
    ! pip3 install --upgrade google-cloud-bigquery $USER_FLAG
    ! pip3 install --upgrade google-cloud-logging $USER_FLAG
    ! pip3 install --upgrade apache-beam[gcp] $USER_FLAG
    ! pip3 install --upgrade pyarrow $USER_FLAG
    ! pip3 install --upgrade cloudml-hypertune $USER_FLAG
    ! pip3 install --upgrade kfp $USER_FLAG
    ! pip3 install future $USER_FLAG

[0mCollecting tensorflow==2.5
  Using cached tensorflow-2.5.0-cp37-cp37m-manylinux2010_x86_64.whl (454.3 MB)
Collecting grpcio~=1.34.0
  Using cached grpcio-1.34.1-cp37-cp37m-manylinux2014_x86_64.whl (4.0 MB)
[0mInstalling collected packages: grpcio, tensorflow
  Attempting uninstall: grpcio
[0m    Found existing installation: grpcio 1.44.0
    Uninstalling grpcio-1.44.0:
      Successfully uninstalled grpcio-1.44.0
  Attempting uninstall: tensorflow
[0m    Found existing installation: tensorflow 2.5.3
    Uninstalling tensorflow-2.5.3:
      Successfully uninstalled tensorflow-2.5.3
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tfx-bsl 1.2.0 requires google-cloud-bigquery<2.21,>=1.28.0, but you have google-cloud-bigquery 2.34.2 which is incompatible.
tfx-bsl 1.2.0 requires pyarrow<3,>=1, but you have pyarrow 7.0.0 which is incompatible.
tensorfl

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [3]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [1]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [2]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

Project ID: vertex-ai-dev


In [3]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [4]:
REGION = "us-central1"  # @param {type: "string"}

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [5]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you submit a custom training job using the Vertex SDK, you upload a Python package
containing your training code to a Cloud Storage bucket. Vertex AI runs
the code from this package. In this tutorial, Vertex AI also saves the
trained model that results from your job in the same bucket. You can then
create an `Endpoint` resource based on this output in order to serve
online predictions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [6]:
BUCKET_NAME = "gs://[your-bucket-name]"  # @param {type:"string"}

In [7]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [8]:
! gsutil mb -l $REGION $BUCKET_NAME

Creating gs://vertex-ai-devaip-20220308120540/...


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [9]:
! gsutil ls -al $BUCKET_NAME

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [10]:
import google.cloud.aiplatform as aip

#### Import Apache Beam

Import the Apache Beam package into your Python environment.

In [11]:
import apache_beam as beam

INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.Sequence[~T]


#### Import BigQuery

Import the BigQuery package into your Python environment.

In [12]:
from google.cloud import bigquery

#### Import pandas

Import the pandas package into your Python environment.

In [13]:
import pandas as pd

#### Import numpy

Import the numpy package into your Python environment.

In [14]:
import numpy as np

#### Import TensorFlow Data Validation

Import the TensorFlow Data Validation (TFDV) package into your Python environment.

In [15]:
import tensorflow_data_validation as tfdv

2022-03-08 12:05:47.126988: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-03-08 12:05:47.127040: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


#### Import TensorFlow Transform

Import the TensorFlow Transform (TFT) package into your Python environment.

In [16]:
import tensorflow_transform as tft

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [17]:
aip.init(project=PROJECT_ID, location=REGION)

### Create BigQuery client

Create the BigQuery client.

In [18]:
bqclient = bigquery.Client()

## Offline preprocessing data with BigQuery table using pandas dataframe

- Offline: The BigQuery table is preprocessed in-memory and stored prior to training.

    - Extract the tabular data into a pandas dataframe.
    - Preprocess the data, per column, within the dataframe.
    - Write the preprocessed dataframe to a new BigQuery table.

In [19]:
IMPORT_FILE = "bq://bigquery-public-data.samples.gsod"
BQ_TABLE = "bigquery-public-data.samples.gsod"

### Read the BigQuery dataset into a pandas dataframe

Next, you read a sample of the dataset into a pandas dataframe using BigQuery `list_rows()` and `to_dataframe()` method, as follows:

- `list_rows()`: Performs a query on the specified table and returns a row iterator to the query results. Optionally specify:
 - `selected_fields`: Subset of fields (columns) to return.
 - `max_results`: The maximum number of rows to return. Same as SQL LIMIT command.


- `rows.to_dataframe()`: Invokes the row iterator and reads in the data into a pandas dataframe.

Learn more about [Loading BigQuery table into a dataframe](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas)

In [20]:
# Download a table.
table = bigquery.TableReference.from_string("bigquery-public-data.samples.gsod")

rows = bqclient.list_rows(
    table,
    max_results=500,
    selected_fields=[
        bigquery.SchemaField("station_number", "STRING"),
        bigquery.SchemaField("year", "INTEGER"),
        bigquery.SchemaField("month", "INTEGER"),
        bigquery.SchemaField("day", "INTEGER"),
        bigquery.SchemaField("mean_temp", "FLOAT"),
    ],
)

dataframe = rows.to_dataframe()
print(dataframe.head())

  station_number  year  month  day  mean_temp
0          39730  1929     10   20  52.799999
1          33110  1929     12   18  47.500000
2          37770  1931      4   24  50.200001
3         726810  1931      6   23  65.099998
4         726810  1931      3    2  42.799999


### Transform data within pandas dataframe.

Next, you preprocess the data within the dataframe.

In [21]:
dataframe["station_number"] = pd.to_numeric(dataframe["station_number"])

### Create BQ dataset resource

First, you create an empty dataset resource in your project.

In [22]:
BQ_MY_DATASET = 'samples'
BQ_MY_TABLE = 'gsod'
! bq --location=US mk -d \
$PROJECT_ID:$BQ_MY_DATASET

BigQuery error in mk operation: Dataset 'vertex-ai-dev:samples' already exists.


In [23]:
job_config = bigquery.LoadJobConfig(
    # Specify a (partial) schema. All columns are always written to the
    # table. The schema is used to assist in data type definitions.
    schema=[
        bigquery.SchemaField("station_number", "FLOAT"),  # <-- after one hot encoding
        bigquery.SchemaField("year", "INTEGER"),
        bigquery.SchemaField("month", "INTEGER"),
        bigquery.SchemaField("day", "INTEGER"),
        bigquery.SchemaField("mean_temp", "FLOAT"),
    ],
    # Optionally, set the write disposition. BigQuery appends loaded rows
    # to an existing table by default, but with WRITE_TRUNCATE write
    # disposition it replaces the table with the loaded data.
    write_disposition="WRITE_TRUNCATE",
)

NEW_BQ_TABLE = f"{PROJECT_ID}.samples.gsod_transformed"

job = bqclient.load_table_from_dataframe(
    dataframe, NEW_BQ_TABLE, job_config=job_config
)  # Make an API request.
job.result()  # Wait for the job to complete.

table = bqclient.get_table(NEW_BQ_TABLE)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), NEW_BQ_TABLE
    )
)

Loaded 500 rows and 5 columns to vertex-ai-dev.samples.gsod_transformed


## Upstream preprocessing data with tf.data.Dataset generator

### Image data

- Upstream: The data is preprocessed upstream from the model while the data is feed for training.

    - Define preprocessing function:
        - Input: unprocessed batch of tensors
        - Output: preprocessed batch of tensors
    - Use tf.data.Dataset `map()` method to map the preprocessing function to the generator output.

In this example:

- Load CIFAR10 dataset into memory as numpy arrays.
- Create a tf.data.Dataset generator for the in-memory CIFAR10 dataset. *Note*: The pixel data is casted to FLOAT32 to be compatiable with the preprocessing function which outputs the pixel data as FLOAT32.
- Define a preprocessing function to rescale the pixel data by 1/255.0
- Map the preprocessing function to the generator.

In [24]:
import tensorflow as tf
from tensorflow.keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

tf_dataset = tf.data.Dataset.from_tensor_slices((x_train.astype(np.float32), y_train))

print("Before preprocessing")
for batch in tf_dataset:
    print(batch)
    break


def preprocess_fn(inputs, labels):
    inputs /= 255.0
    return tf.cast(inputs, tf.float32), labels


tf_dataset = tf_dataset.map(preprocess_fn)

print("After preprocessing")
for batch in tf_dataset:
    print(batch)
    break

2022-03-08 12:06:45.931376: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-03-08 12:06:45.931423: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2022-03-08 12:06:45.931445: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (vertex-ai-sdk-manuel-2): /proc/driver/nvidia/version does not exist
2022-03-08 12:06:45.931907: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-03-08 12:06:46.434283: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 6144

Before preprocessing
(<tf.Tensor: shape=(32, 32, 3), dtype=float32, numpy=
array([[[ 59.,  62.,  63.],
        [ 43.,  46.,  45.],
        [ 50.,  48.,  43.],
        ...,
        [158., 132., 108.],
        [152., 125., 102.],
        [148., 124., 103.]],

       [[ 16.,  20.,  20.],
        [  0.,   0.,   0.],
        [ 18.,   8.,   0.],
        ...,
        [123.,  88.,  55.],
        [119.,  83.,  50.],
        [122.,  87.,  57.]],

       [[ 25.,  24.,  21.],
        [ 16.,   7.,   0.],
        [ 49.,  27.,   8.],
        ...,
        [118.,  84.,  50.],
        [120.,  84.,  50.],
        [109.,  73.,  42.]],

       ...,

       [[208., 170.,  96.],
        [201., 153.,  34.],
        [198., 161.,  26.],
        ...,
        [160., 133.,  70.],
        [ 56.,  31.,   7.],
        [ 53.,  34.,  20.]],

       [[180., 139.,  96.],
        [173., 123.,  42.],
        [186., 144.,  30.],
        ...,
        [184., 148.,  94.],
        [ 97.,  62.,  34.],
        [ 83.,  53.,  34.]]

2022-03-08 12:06:47.184534: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 614400000 exceeds 10% of free system memory.


(<tf.Tensor: shape=(32, 32, 3), dtype=float32, numpy=
array([[[0.23137255, 0.24313726, 0.24705882],
        [0.16862746, 0.18039216, 0.1764706 ],
        [0.19607843, 0.1882353 , 0.16862746],
        ...,
        [0.61960787, 0.5176471 , 0.42352942],
        [0.59607846, 0.49019608, 0.4       ],
        [0.5803922 , 0.4862745 , 0.40392157]],

       [[0.0627451 , 0.07843138, 0.07843138],
        [0.        , 0.        , 0.        ],
        [0.07058824, 0.03137255, 0.        ],
        ...,
        [0.48235294, 0.34509805, 0.21568628],
        [0.46666667, 0.3254902 , 0.19607843],
        [0.47843137, 0.34117648, 0.22352941]],

       [[0.09803922, 0.09411765, 0.08235294],
        [0.0627451 , 0.02745098, 0.        ],
        [0.19215687, 0.10588235, 0.03137255],
        ...,
        [0.4627451 , 0.32941177, 0.19607843],
        [0.47058824, 0.32941177, 0.19607843],
        [0.42745098, 0.28627452, 0.16470589]],

       ...,

       [[0.8156863 , 0.6666667 , 0.3764706 ],
        [0.788

2022-03-08 12:06:47.614202: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2022-03-08 12:06:47.615015: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2199995000 Hz


## Upstream preprocessing data with tf.data.Dataset generator

### Tabular data

- Upstream: The data is preprocessed upstream from the model while the data is feed for training.

    - Define preprocessing function:
        - Input: unprocessed batch of tensors
        - Output: preprocessed batch of tensors
    - Use tf.data.Dataset `map()` method to map the preprocessing function to the generator output.

In this example:

- Create tf.data.Dataset generator for Boston Housing data.
- Iterate a single batch before preprocessing.
- Define preprocessing function to scale all the features between 0 and 1.
- Map the preprocessing function to the dataset.
- Iterate once through the dataset.

In [25]:
from tensorflow.keras.datasets import boston_housing

(x_train, y_train), (x_test, y_test) = boston_housing.load_data()

tf_dataset = tf.data.Dataset.from_tensor_slices((x_train.astype(np.float32), y_train))

print("Before preprocessing")
for batch in tf_dataset:
    print(batch)
    break


def preprocessing_fn(inputs, labels):
    inputs = tft.scale_to_0_1(inputs)
    return tf.cast(inputs, tf.float32), labels


tf_dataset = tf_dataset.map(preprocessing_fn)

print("After preprocessing")
for batch in tf_dataset:
    print(batch)
    break

Before preprocessing
(<tf.Tensor: shape=(13,), dtype=float32, numpy=
array([  1.23247,   0.     ,   8.14   ,   0.     ,   0.538  ,   6.142  ,
        91.7    ,   3.9769 ,   4.     , 307.     ,  21.     , 396.9    ,
        18.72   ], dtype=float32)>, <tf.Tensor: shape=(), dtype=float64, numpy=15.2>)
After preprocessing
(<tf.Tensor: shape=(13,), dtype=float32, numpy=
array([0.7742506 , 0.5       , 0.9997084 , 0.5       , 0.63134706,
       0.997854  , 1.        , 0.98160124, 0.98201376, 1.        ,
       1.        , 1.        , 1.        ], dtype=float32)>, <tf.Tensor: shape=(), dtype=float64, numpy=15.2>)


## Offline preprocessing with Dataflow

- Generate data chema from BigQuery table.
- Define Beam pipeline to:
    - Split data from BigQuery table into train and eval datasets.
    - Encode datasets as TFRecords, using the data schema.
    - Save the TFRecords as compressed files to Cloud Storage
- Run the pipeline

### Read the BigQuery dataset into a pandas dataframe

Next, you read a sample of the dataset into a pandas dataframe using BigQuery `list_rows()` and `to_dataframe()` method, as follows:

- `list_rows()`: Performs a query on the specified table and returns a row iterator to the query results. Optionally specify:
 - `selected_fields`: Subset of fields (columns) to return.
 - `max_results`: The maximum number of rows to return. Same as SQL LIMIT command.


- `rows.to_dataframe()`: Invokes the row iterator and reads in the data into a pandas dataframe.

Learn more about [Loading BigQuery table into a dataframe](https://cloud.google.com/bigquery/docs/bigquery-storage-python-pandas)

In [26]:
# Download a table.
table = bigquery.TableReference.from_string("bigquery-public-data.samples.gsod")

rows = bqclient.list_rows(
    table,
    max_results=500,
    selected_fields=[
        bigquery.SchemaField("station_number", "STRING"),
        bigquery.SchemaField("year", "INTEGER"),
        bigquery.SchemaField("month", "INTEGER"),
        bigquery.SchemaField("day", "INTEGER"),
        bigquery.SchemaField("mean_temp", "FLOAT"),
    ],
)

dataframe = rows.to_dataframe()
print(dataframe.head())

  station_number  year  month  day  mean_temp
0          39730  1929     10   20  52.799999
1          33110  1929     12   18  47.500000
2          37770  1931      4   24  50.200001
3         726810  1931      6   23  65.099998
4         726810  1931      3    2  42.799999


###  Generate dataset statistics

#### Dataframe input data

Generate statistics on the dataset with the TensorFlow Data Validation (TFDV) package. Use the `generate_statistics_from_dataframe()` method, with the following parameters:

- `dataframe`: The dataset in an in-memory pandas dataframe.
- `stats_options`: The selected statistics options:
  - `label_feature`: The column which is the label to predict.
  - `sample_rate`: The sampling rate. If specified, statistics is computed over the sample.
  - `num_top_values`: number of most frequent feature values to keep for string features.

Learn about [TensorFlow Data Validation (TFDV)](https://www.tensorflow.org/tfx/data_validation/get_started).

In [27]:
stats = tfdv.generate_statistics_from_dataframe(
    dataframe=dataframe,
    stats_options=tfdv.StatsOptions(
        label_feature="mean_temp", sample_rate=1, num_top_values=50
    ),
)

print(stats)

datasets {
  num_examples: 500
  features {
    type: STRING
    string_stats {
      common_stats {
        num_non_missing: 500
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 50.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 50.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 50.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 50.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 50.0
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 50.0
          }
          buckets {
            low_value: 1.0
    

###  Generate the raw data schema

Generate the data schema on the dataset with the TensorFlow Data Validation (TFDV) package. Use the `infer_schema()` method, with the following parameters:

- `statistics`: The statistics generated by TFDV.

In [28]:
schema = tfdv.infer_schema(statistics=stats)
print(schema)

feature {
  name: "station_number"
  type: BYTES
  int_domain {
  }
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "year"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "month"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "day"
  type: INT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "mean_temp"
  type: FLOAT
  presence {
    min_fraction: 1.0
    min_count: 1
  }
  shape {
    dim {
      size: 1
    }
  }
}



#### Save schema for the dataset to Cloud Storage

Next, you write the schema for the dataset to the dataset's Cloud Storage bucket.

In [29]:
SCHEMA_LOCATION = BUCKET_NAME + "/schema.txt"

# When running Apache Beam directly (file is directly accessed)
tfdv.write_schema_text(output_path=SCHEMA_LOCATION, schema=schema)
# When running with Dataflow (file is uploaded to worker pool)
tfdv.write_schema_text(output_path="schema.txt", schema=schema)

#### Prepare package requirements for Dataflow job.

Before you can run a Dataflow job, you need to specify the package requirements for the worker pool that will execute the job.

In [30]:
%%writefile setup.py
import setuptools

REQUIRED_PACKAGES = [
    "google-cloud-aiplatform==1.4.2",
    "tensorflow-transform==1.2.0",
    "tensorflow-data-validation==1.2.0",
]

setuptools.setup(
    name="executor",
    version="0.0.1",
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
    include_package_data=True,
    package_data={"./": ["schema.txt"]}
)

Overwriting setup.py


### Preprocess data with Dataflow

#### Dataset splitting

Next, you preprocess the data using Dataflow. In this example, you query the BigQuery table and split the examples into training and evaluation datasets. For expendiency, the number of examples from the dataset is limited to 500.

In [31]:
import os

import tensorflow_transform.beam as tft_beam

RUNNER = "DataflowRunner"  # DirectRunner for local running w/o Dataflow


def parse_bq_record(bq_record):
    """Parses a bq_record to a dictionary."""
    output = {}
    for key in bq_record:
        output[key] = [bq_record[key]]
    return output


def split_dataset(bq_row, num_partitions, ratio):
    """Returns a partition number for a given bq_row."""
    import json

    assert num_partitions == len(ratio)
    bucket = sum(map(ord, json.dumps(bq_row))) % sum(ratio)
    total = 0
    for i, part in enumerate(ratio):
        total += part
        if bucket < total:
            return i
    return len(ratio) - 1


def run_pipeline(args):
    """Runs a Beam pipeline to split the dataset"""

    pipeline_options = beam.pipeline.PipelineOptions(flags=[], **args)

    raw_data_query = args["raw_data_query"]
    exported_data_prefix = args["exported_data_prefix"]
    temp_location = args["temp_location"]
    project = args["project"]

    schema = tfdv.load_schema_text(SCHEMA_LOCATION)

    with beam.Pipeline(options=pipeline_options) as pipeline:
        with tft_beam.Context(temp_location):

            # Read raw BigQuery data.
            raw_train_data, raw_eval_data = (
                pipeline
                | "Read Raw Data"
                >> beam.io.ReadFromBigQuery(
                    query=raw_data_query,
                    project=project,
                    use_standard_sql=True,
                )
                | "Parse Data" >> beam.Map(parse_bq_record)
                | "Split" >> beam.Partition(split_dataset, 2, ratio=[8, 2])
            )

            _ = (
                raw_train_data
                | "Write Raw Train Data"
                >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=os.path.join(exported_data_prefix, "train/"),
                    file_name_suffix=".gz",
                    coder=tft.coders.ExampleProtoCoder(schema),
                )
            )

            _ = (
                raw_eval_data
                | "Write Raw Eval Data"
                >> beam.io.tfrecordio.WriteToTFRecord(
                    file_path_prefix=os.path.join(exported_data_prefix, "eval/"),
                    file_name_suffix=".gz",
                    coder=tft.coders.ExampleProtoCoder(schema),
                )
            )


EXPORTED_DATA_PREFIX = os.path.join(BUCKET_NAME, "exported_data")

QUERY_STRING = "SELECT {},{} FROM {} LIMIT 500".format(
    "CAST(station_number as STRING) AS station_number,year,month,day",
    "mean_temp",
    IMPORT_FILE[5:],
)
JOB_NAME = "gsod" + TIMESTAMP

args = {
    "runner": RUNNER,
    "raw_data_query": QUERY_STRING,
    "exported_data_prefix": EXPORTED_DATA_PREFIX,
    "temp_location": os.path.join(BUCKET_NAME, "temp"),
    "project": PROJECT_ID,
    "region": REGION,
    "setup_file": "./setup.py",
}

print("Data preprocessing started...")
run_pipeline(args)
print("Data preprocessing completed.")

! gsutil ls $EXPORTED_DATA_PREFIX/train
! gsutil ls $EXPORTED_DATA_PREFIX/eval

Data preprocessing started...


  temp_location = pcoll.pipeline.options.view_as(


INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/testing bucket/myenv/bin/python', 'setup.py', 'sdist', '--dist-dir', '/tmp/tmpnlcbw83i']
INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/testing bucket/myenv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpnlcbw83i', 'apache-beam==2.37.0', '--no-deps', '--no-binary', ':all:']







INFO:apache_beam.runners.portability.stager:Staging SDK sources from PyPI: dataflow_python_sdk.tar
INFO:apache_beam.runners.portability.stager:Downloading binary distribution of the SDK from PyPi
INFO:apache_beam.runners.portability.stager:Executing command: ['/home/jupyter/testing bucket/myenv/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/tmpnlcbw83i', 'apache-beam==2.37.0', '--no-deps', '--only-binary', ':all:', '--python-version', '37', '--implementation', 'cp', '--abi', 'cp37m', '--platform', 'manylinux1_x86_64']


You should consider upgrading via the '/home/jupyter/testing bucket/myenv/bin/python -m pip install --upgrade pip' command.


INFO:apache_beam.runners.portability.stager:Staging binary distribution of the SDK from PyPI: apache_beam-2.37.0-cp37-cp37m-manylinux1_x86_64.whl
INFO:root:Default Python SDK image for environment is apache/beam_python3.7_sdk:2.37.0
INFO:root:Using provided Python SDK container image: gcr.io/cloud-dataflow/v1beta3/python37:2.37.0
INFO:root:Python SDK container image set to "gcr.io/cloud-dataflow/v1beta3/python37:2.37.0" for Docker environment


You should consider upgrading via the '/home/jupyter/testing bucket/myenv/bin/python -m pip install --upgrade pip' command.


INFO:apache_beam.runners.dataflow.internal.apiclient:Defaulting to the temp_location as staging_location: gs://vertex-ai-devaip-20220308120540/temp
INFO:apache_beam.internal.gcp.auth:Setting socket default timeout to 60 seconds.
INFO:apache_beam.internal.gcp.auth:socket default timeout is 60.0 seconds.
INFO:oauth2client.transport:Attempting refresh to obtain initial access_token
INFO:apache_beam.io.gcp.bigquery_tools:Started BigQuery job: <JobReference
 location: 'US'
 projectId: 'vertex-ai-dev'>
 bq show -j --format=prettyjson --project_id=vertex-ai-dev None
INFO:apache_beam.runners.dataflow.internal.apiclient:Starting GCS upload to gs://vertex-ai-devaip-20220308120540/temp/beamapp-jupyter-0308120657-326426-8j271gsl.1646741217.327548/workflow.tar.gz...
INFO:apache_beam.runners.dataflow.internal.apiclient:Completed GCS upload to gs://vertex-ai-devaip-20220308120540/temp/beamapp-jupyter-0308120657-326426-8j271gsl.1646741217.327548/workflow.tar.gz in 0 seconds.
INFO:apache_beam.runners.d

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Cloud Storage Bucket

In [41]:
if "BUCKET_NAME" in globals():
    ! gsutil rm -r $BUCKET_NAME

Removing gs://vertex-ai-devaip-20220308120540/schema.txt#1646741209649207...
Removing gs://vertex-ai-devaip-20220308120540/exported_data/eval/-00000-of-00001.gz#1646741709567696...
Removing gs://vertex-ai-devaip-20220308120540/exported_data/train/-00000-of-00001.gz#1646741711927311...
Removing gs://vertex-ai-devaip-20220308120540/temp/beamapp-jupyter-0308120657-326426-8j271gsl.1646741217.327548/apache_beam-2.37.0-cp37-cp37m-manylinux1_x86_64.whl#1646741219271257...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rm ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://vertex-ai-devaip-20220308120540/temp/beamapp-jupyter-0308120657-326426-8j271gsl.1646741217.327548/dataflow_python_sdk.tar#1646741218292952...
Removing gs://vertex-ai-devaip-2022030812