# Exploring tf.transform # 

**Learning Objectives**
1. Preprocess data and engineer new features using TfTransform. 
1. Create and deploy Apache Beam pipeline. 
1. Use processed data to train taxifare model locally then serve a prediction.

## Introduction 
While Pandas is fine for experimenting, for operationalization of your workflow it is better to do preprocessing in Apache Beam. This will also help if you need to preprocess data in flight, since Apache Beam allows for streaming. In this lab we will pull data from BigQuery then use Apache Beam  TfTransform to process the data.  

Only specific combinations of TensorFlow/Beam are supported by tf.transform so make sure to get a combo that works. In this lab we will be using: 
* TFT 0.24.0
* TF 2.3.0 
* Apache Beam [GCP] 2.24.0

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/5_tftransform_taxifare.ipynb) -- try to complete that notebook first before reviewing this solution notebook.

In [None]:
# Run the chown command to change the ownership
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [1]:
# Install the necessary dependencies
!pip install tensorflow==2.3.0 tensorflow-transform==0.24.0 apache-beam[gcp]==2.24.0

Collecting tensorflow-transform==0.24.0
Downloading tensorflow_transform-0.24.0-py3-none-any.whl (373 kB)
 |████████████████████████████████| 373 kB 4.4 MB/s eta 0:00:01
Collecting apache-beam[gcp]==2.24.0
  Downloading apache_beam-2.24.0-cp37-cp37m-manylinux2010_x86_64.whl (8.5 MB)
     |████████████████████████████████| 8.5 MB 8.6 MB/s eta 0:00:01
Collecting scipy==1.4.1
  Downloading scipy-1.4.1-cp37-cp37m-manylinux1_x86_64.whl (26.1 MB)
     |████████████████████████████████| 26.1 MB 65.8 MB/s eta 0:00:01
...
...
...
Successfully built oauth2client
Installing collected packages: httplib2, oauth2client, mock, pyarrow, cachetools, apache-beam, tensorflow-metadata, tfx-bsl, tensorflow-transform, scipy
  Attempting uninstall: httplib2
Found existing installation: httplib2 0.18.1
Uninstalling httplib2-0.18.1:
  Successfully uninstalled httplib2-0.18.1
Attempting uninstall: oauth2client
Found existing installation: oauth2client 4.1.3
Uninstalling oauth2client-4.1.3:
  Successfully uninst

**NOTE**: You may ignore specific incompatibility errors and warnings. These components and issues do not impact your ability to complete the lab.
Download .whl file for tensorflow-transform. We will pass this file to Beam Pipeline Options so it is installed on the DataFlow workers 

In [None]:
!pip install --user google-cloud-bigquery==1.25.0

In [2]:
!pip download tensorflow-transform==0.24.0 --no-deps

Collecting tensorflow-transform==0.24.0
Using cached tensorflow_transform-0.24.0-py3-none-any.whl (373 kB)
Saved ./tensorflow_transform-0.24.0-py3-none-any.whl
Successfully downloaded tensorflow-transform


<b>Restart the kernel</b> (click on the reload button above).

In [1]:
%%bash
# Output installed packages in requirements format.
pip freeze | grep -e 'flow\|beam'

apache-beam==2.24.0
tensorflow @ file:///opt/conda/conda-bld/dlenv-tf-2-3-cpu_1599847782078/work/tensorflow-2.3.0-cp37-cp37m-linux_x86_64.whl
tensorflow-data-validation==0.23.
tensorflow-datasets==3.2.1
tensorflow-enterprise-addons @ file:///opt/conda/conda-bld/dlenv-tf-2-3-cpu_1599847782078/work/tensorflow_enterprise_addons-0.0.0-py3-none-any.whl
tensorflow-estimator==2.3.0
tensorflow-hub==0.9.0
tensorflow-io==0.15.0
tensorflow-metadata==0.24.0
tensorflow-model-analysis==0.23.0
tensorflow-probability==0.11.0
tensorflow-serving-api==2.3.0
tensorflow-transform==0.24.0


In [2]:
# Import data processing libraries
import tensorflow as tf
import tensorflow_transform as tft
# Python shutil module enables us to operate with file objects easily and without diving into file objects a lot.
import shutil
# Show the currently installed version of TensorFlow
print(tf.__version__)

2.3.0


In [3]:
# change these to try this notebook out
BUCKET = 'cloud-example-labs'
PROJECT = 'project-id'
REGION = 'us-central1'

In [4]:
# The OS module in python provides functions for interacting with the operating system.
import os
os.environ['BUCKET'] = BUCKET
os.environ['PROJECT'] = PROJECT
os.environ['REGION'] = REGION

In [5]:
%%bash
# gcloud config set - set a Cloud SDK property
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


In [6]:
%%bash
# Create bucket
if ! gcloud storage ls | grep -q gs://${BUCKET}/; then
  gcloud storage buckets create --location=${REGION} gs://${BUCKET}
fi

Creating gs://qwiklabs-gcp-xxxxxxxx/...


## Input source: BigQuery

Get data from BigQuery but defer the majority of filtering etc. to Beam.
Note that the dayofweek column is now strings.

In [7]:
# Import Google BigQuery API client library
from google.cloud import bigquery


def create_query(phase, EVERY_N):
    """Creates a query with the proper splits.

    Args:
        phase: int, 1=train, 2=valid.
        EVERY_N: int, take an example EVERY_N rows.

    Returns:
        Query string with the proper splits.
    """
    base_query = """
    WITH daynames AS
    (SELECT ['Sun', 'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat'] AS daysofweek)
    SELECT
    (tolls_amount + fare_amount) AS fare_amount,
    daysofweek[ORDINAL(EXTRACT(DAYOFWEEK FROM pickup_datetime))] AS dayofweek,
    EXTRACT(HOUR FROM pickup_datetime) AS hourofday,
    pickup_longitude AS pickuplon,
    pickup_latitude AS pickuplat,
    dropoff_longitude AS dropofflon,
    dropoff_latitude AS dropofflat,
    passenger_count AS passengers,
    'notneeded' AS key
    FROM
    `nyc-tlc.yellow.trips`, daynames
    WHERE
    trip_distance > 0 AND fare_amount > 0
    """
    if EVERY_N is None:
        if phase < 2:
            # training
            query = """{0} AND ABS(MOD(FARM_FINGERPRINT(CAST
            (pickup_datetime AS STRING), 4)) < 2""".format(base_query)
        else:
            query = """{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(
            pickup_datetime AS STRING), 4)) = {1}""".format(base_query, phase)
    else:
        query = """{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(
        pickup_datetime AS STRING)), {1})) = {2}""".format(
            base_query, EVERY_N, phase)

    return query

query = create_query(2, 100000)


Let's pull this query down into a Pandas DataFrame and take a look at some of the statistics.

In [8]:
df_valid = bigquery.Client().query(query).to_dataframe()
# `head()` function is used to get the first n rows of dataframe
display(df_valid.head())
# `describe()` is use to get the statistical summary of the DataFrame
df_valid.describe()

Unnamed: 0,fare_amount,dayofweek,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers,key
0,4.5,Sat,0,-74.000292,40.728722,-73.995235,40.724961,1,notneeded
1,6.9,Sun,0,-73.986003,40.722688,-74.004549,40.718822,1,notneeded
2,16.5,Sun,0,-74.002155,40.740375,-73.967537,40.792845,5,notneeded
3,143.0,Sun,0,-73.990255,40.740407,-74.350245,40.663847,1,notneeded
4,19.0,Sun,0,-73.977255,40.75493,-73.91757,40.767272,1,notneeded


Unnamed: 0,fare_amount,hourofday,pickuplon,pickuplat,dropofflon,dropofflat,passengers
count,11181.0,11181.0,11181.0,11181.0,11181.0,11181.0,11181.0
mean,11.242599,13.244075,-72.576852,39.973146,-72.748974,40.006091,1.722118
std,9.447462,6.548354,10.133452,5.777329,12.981577,5.664887,1.351062
min,2.5,0.0,-78.133333,-73.991278,-751.4,-73.97797,0.0
25%,6.0,9.0,-73.991849,40.734954,-73.991236,40.734008,1.0
50%,8.5,14.0,-73.981824,40.75264,-73.980164,40.753427,1.0
75%,12.5,19.0,-73.967418,40.7667,-73.964153,40.767832,2.0
max,143.0,23.0,40.806487,41.366138,40.7854,41.366138,6.0


## Create ML dataset using tf.transform and Dataflow

Let's use Cloud Dataflow to read in the BigQuery data and write it out as TFRecord files. Along the way, let's use tf.transform to do scaling and transforming. Using tf.transform allows us to save the metadata to ensure that the appropriate transformations get carried out during prediction as well.

`transformed_data` is type `pcollection`.

In [9]:
# Import a module named `datetime` to work with dates as date objects.
import datetime
# Import data processing libraries and modules
import tensorflow as tf
import apache_beam as beam
import tensorflow_transform as tft
import tensorflow_metadata as tfmd
from tensorflow_transform.beam import impl as beam_impl


def is_valid(inputs):
    """Check to make sure the inputs are valid.

    Args:
        inputs: dict, dictionary of TableRow data from BigQuery.

    Returns:
        True if the inputs are valid and False if they are not.
    """
    try:
        pickup_longitude = inputs['pickuplon']
        dropoff_longitude = inputs['dropofflon']
        pickup_latitude = inputs['pickuplat']
        dropoff_latitude = inputs['dropofflat']
        hourofday = inputs['hourofday']
        dayofweek = inputs['dayofweek']
        passenger_count = inputs['passengers']
        fare_amount = inputs['fare_amount']
        return fare_amount >= 2.5 and pickup_longitude > -78 \
            and pickup_longitude < -70 and dropoff_longitude > -78 \
            and dropoff_longitude < -70 and pickup_latitude > 37 \
            and pickup_latitude < 45 and dropoff_latitude > 37 \
            and dropoff_latitude < 45 and passenger_count > 0
    except:
        return False


def preprocess_tft(inputs):
    """Preprocess the features and add engineered features with tf transform.

    Args:
        dict, dictionary of TableRow data from BigQuery.

    Returns:
        Dictionary of preprocessed data after scaling and feature engineering.
    """
    import datetime
    print(inputs)
    result = {}
    result['fare_amount'] = tf.identity(inputs['fare_amount'])
    # build a vocabulary
    # TODO 1
    result['dayofweek'] = tft.string_to_int(inputs['dayofweek'])
    result['hourofday'] = tf.identity(inputs['hourofday'])  # pass through
    # scaling numeric values
    # TODO 2
    result['pickuplon'] = (tft.scale_to_0_1(inputs['pickuplon']))
    result['pickuplat'] = (tft.scale_to_0_1(inputs['pickuplat']))
    result['dropofflon'] = (tft.scale_to_0_1(inputs['dropofflon']))
    result['dropofflat'] = (tft.scale_to_0_1(inputs['dropofflat']))
    result['passengers'] = tf.cast(inputs['passengers'], tf.float32)  # a cast
    # arbitrary TF func
    result['key'] = tf.as_string(tf.ones_like(inputs['passengers']))
    # engineered features
    latdiff = inputs['pickuplat'] - inputs['dropofflat']
    londiff = inputs['pickuplon'] - inputs['dropofflon']
    # Scale our engineered features latdiff and londiff between 0 and 1
    # TODO 3
    result['latdiff'] = tft.scale_to_0_1(latdiff)
    result['londiff'] = tft.scale_to_0_1(londiff)
    dist = tf.sqrt(latdiff * latdiff + londiff * londiff)
    result['euclidean'] = tft.scale_to_0_1(dist)
    return result


def preprocess(in_test_mode):
    """Sets up preprocess pipeline.

    Args:
        in_test_mode: bool, False to launch DataFlow job, True to run locally.
    """
    import os
    import os.path
    import tempfile
    from apache_beam.io import tfrecordio
    from tensorflow_transform.coders import example_proto_coder
    from tensorflow_transform.tf_metadata import dataset_metadata
    from tensorflow_transform.tf_metadata import dataset_schema
    from tensorflow_transform.beam import tft_beam_io
    from tensorflow_transform.beam.tft_beam_io import transform_fn_io

    job_name = 'preprocess-taxi-features' + '-'
    job_name += datetime.datetime.now().strftime('%y%m%d-%H%M%S')
    if in_test_mode:
        import shutil
        print('Launching local job ... hang on')
        OUTPUT_DIR = './preproc_tft'
        shutil.rmtree(OUTPUT_DIR, ignore_errors=True)
        EVERY_N = 100000
    else:
        print('Launching Dataflow job {} ... hang on'.format(job_name))
        OUTPUT_DIR = 'gs://{0}/taxifare/preproc_tft/'.format(BUCKET)
        import subprocess
        subprocess.call('gcloud storage rm --recursive {}'.format(OUTPUT_DIR).split())
        EVERY_N = 10000

    options = {
        'staging_location': os.path.join(OUTPUT_DIR, 'tmp', 'staging'),
        'temp_location': os.path.join(OUTPUT_DIR, 'tmp'),
        'job_name': job_name,
        'project': PROJECT,
        'num_workers': 1,
        'max_num_workers': 1,
        'teardown_policy': 'TEARDOWN_ALWAYS',
        'no_save_main_session': True,
        'direct_num_workers': 1,
        'extra_packages': ['tensorflow_transform-0.24.0-py3-none-any.whl']
        }

    opts = beam.pipeline.PipelineOptions(flags=[], **options)
    if in_test_mode:
        RUNNER = 'DirectRunner'
    else:
        RUNNER = 'DataflowRunner'

    # Set up raw data metadata
    raw_data_schema = {
        colname: dataset_schema.ColumnSchema(
            tf.string, [], dataset_schema.FixedColumnRepresentation())
        for colname in 'dayofweek,key'.split(',')
    }

    raw_data_schema.update({
        colname: dataset_schema.ColumnSchema(
            tf.float32, [], dataset_schema.FixedColumnRepresentation())
        for colname in
        'fare_amount,pickuplon,pickuplat,dropofflon,dropofflat'.split(',')
    })

    raw_data_schema.update({
        colname: dataset_schema.ColumnSchema(
            tf.int64, [], dataset_schema.FixedColumnRepresentation())
        for colname in 'hourofday,passengers'.split(',')
    })

    raw_data_metadata = dataset_metadata.DatasetMetadata(
        dataset_schema.Schema(raw_data_schema))

    # Run Beam
    with beam.Pipeline(RUNNER, options=opts) as p:
        with beam_impl.Context(temp_dir=os.path.join(OUTPUT_DIR, 'tmp')):
            # Save the raw data metadata
            (raw_data_metadata |
                'WriteInputMetadata' >> tft_beam_io.WriteMetadata(
                    os.path.join(
                        OUTPUT_DIR, 'metadata/rawdata_metadata'), pipeline=p))

            # Read training data from bigquery and filter rows
            raw_data = (p | 'train_read' >> beam.io.Read(
                    beam.io.BigQuerySource(
                        query=create_query(1, EVERY_N),
                        use_standard_sql=True)) |
                        'train_filter' >> beam.Filter(is_valid))

            raw_dataset = (raw_data, raw_data_metadata)

            # Analyze and transform training data
            # TODO 4
            transformed_dataset, transform_fn = (
                raw_dataset | beam_impl.AnalyzeAndTransformDataset(
                    preprocess_tft))
            transformed_data, transformed_metadata = transformed_dataset

            # Save transformed train data to disk in efficient tfrecord format
            transformed_data | 'WriteTrainData' >> tfrecordio.WriteToTFRecord(
                os.path.join(OUTPUT_DIR, 'train'), file_name_suffix='.gz',
                coder=example_proto_coder.ExampleProtoCoder(
                    transformed_metadata.schema))

            # Read eval data from bigquery and filter rows
            # TODO 5
            raw_test_data = (p | 'eval_read' >> beam.io.Read(
                beam.io.BigQuerySource(
                    query=create_query(2, EVERY_N),
                    use_standard_sql=True)) | 'eval_filter' >> beam.Filter(
                        is_valid))

            raw_test_dataset = (raw_test_data, raw_data_metadata)

            # Transform eval data
            transformed_test_dataset = (
                (raw_test_dataset, transform_fn) | beam_impl.TransformDataset()
                )
            transformed_test_data, _ = transformed_test_dataset

            # Save transformed train data to disk in efficient tfrecord format
            (transformed_test_data |
                'WriteTestData' >> tfrecordio.WriteToTFRecord(
                    os.path.join(OUTPUT_DIR, 'eval'), file_name_suffix='.gz',
                    coder=example_proto_coder.ExampleProtoCoder(
                        transformed_metadata.schema)))

            # Save transformation function to disk for use at serving time
            (transform_fn |
                'WriteTransformFn' >> transform_fn_io.WriteTransformFn(
                    os.path.join(OUTPUT_DIR, 'metadata')))

# Change to True to run locally
preprocess(in_test_mode=False)


Launching Dataflow job preprocess-taxi-features-200924-103136 ... hang on
{'dayofweek': <tf.Tensor 'inputs/inputs/dayofweek/Placeholder_copy:0' shape=(None,) dtype=string>, 'dropofflat': <tf.Tensor 'inputs/inputs/dropofflat/Placeholder_copy:0' shape=(None,) dtype=float32>, 'dropofflon': <tf.Tensor 'inputs/inputs/dropofflon/Placeholder_copy:0' shape=(None,) dtype=float32>, 'fare_amount': <tf.Tensor 'inputs/inputs/F_fare_amount/Placeholder_copy:0' shape=(None,) dtype=float32>, 'hourofday': <tf.Tensor 'inputs/inputs/hourofday/Placeholder_copy:0' shape=(None,)
dtype=int64>, 'key': <tf.Tensor 'inputs/inputs/key/Placeholder_copy:0' shape=(None,) dtype=string>, 'passengers': <tf.Tensor 'inputs/inputs/passengers/Placeholder_copy:0' shape=(None,) dtype=int64>, 'pickuplat': <tf.Tensor 'inputs/inputs/pickuplat/Placeholder_copy:0' shape=(None,) dtype=float32>, 'pickuplon': <tf.Tensor 'inputs/inputs/pickuplon/Placeholder_copy:0' shape=(None,) dtype=float32>}
INFO:tensorflow:Assets added to graph.
I

This will take __10-15 minutes__. You cannot go on in this lab until your DataFlow job has successfully completed. 

**Note**: The above command may fail with an error **`Workflow failed. Causes: There was a problem refreshing your credentials`**. In that case, `re-run` the command again.

In [10]:
%%bash
# ls preproc_tft
# `ls` command show the full list or content of your directory
gcloud storage ls gs://${BUCKET}/taxifare/preproc_tft/

gs://cloud-example-labs/taxifare/preproc_tft/
gs://cloud-example-labs/taxifare/preproc_tft/eval-00000-of-00001.gz
gs://cloud-example-labs/taxifare/preproc_tft/train-00000-of-00004.gz
gs://cloud-example-labs/taxifare/preproc_tft/train-00001-of-00004.gz
gs://cloud-example-labs/taxifare/preproc_tft/train-00002-of-00004.gz
gs://cloud-example-labs/taxifare/preproc_tft/train-00003-of-00004.gz
gs://cloud-example-labs/taxifare/preproc_tft/metadata/
gs://cloud-example-labs/taxifare/preproc_tft/tmp/


## Train off preprocessed data ##
Now that we have our data ready and verified it is in the correct location we can train our taxifare model locally.

In [12]:
%%bash
# Train our taxifare model locally
rm -r ./taxi_trained
export PYTHONPATH=${PYTHONPATH}:$PWD
python3 -m tft_trainer.task \
    --train_data_path="gs://${BUCKET}/taxifare/preproc_tft/train*" \
    --eval_data_path="gs://${BUCKET}/taxifare/preproc_tft/eval*"  \
    --output_dir=./taxi_trained \

rm: cannot remove './taxi_trained': No such file or directory
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_log_step_count_steps': 100, '_device_fn': None, '_service': None, '_model_dir': './taxi_trained', '_experimental_distribute': None, '_protocol': None, '_is_chief': True, '_task_id': 0, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f15433f00f0>, '_task_type': 'worker', '_evaluation_master': '', '_master': '', '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': None, '_keep_checkpoint_max': 5, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_num_ps_replicas': 0, '_train_distribute': None, '_save_summary_steps': 100, '_eval_distribute': None, '_experimental_max_worker_delay_secs': None, '_num_worker_replicas': 1, '_global_id_in_cluster': 0, '_session_creation_timeout_secs': 72

In [13]:
# `ls` command show the full list or content of your directory
!ls $PWD/taxi_trained/export/exporter

1600944292


Now let's create fake data in JSON format and use it to serve a prediction with gcloud ai-platform local predict

In [19]:
%%writefile /tmp/test.json
{"dayofweek":0, "hourofday":17, "pickuplon": -73.885262, "pickuplat": 40.773008, "dropofflon": -73.987232, "dropofflat": 40.732403, "passengers": 2.0}

Overwriting /tmp/test.json


In [20]:
%%bash
sudo find "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/command_lib/ml_engine" -name '*.pyc' -delete

In [21]:
%%bash
# Serve a prediction with gcloud ai-platform local predict
model_dir=$(ls $PWD/taxi_trained/export/exporter/)
gcloud ai-platform local predict \
    --model-dir=./taxi_trained/export/exporter/${model_dir} \
    --json-instances=/tmp/test.json

PREDICTIONS
[8.322036743164062]


If the signature defined in the model is not serving_default then you must specify it via --signature-name flag, otherwise the command may fail.
Instructions for updating:
non-resource variables are not supported in the long term
2019-12-17 21:50:07.411300: I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations:  AVX2 FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.
2019-12-17 21:50:07.419152: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2200000000 Hz
2019-12-17 21:50:07.419442: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x560ff33e0950 executing computations on platform Host. Devices:
2019-12-17 21:50:07.419471: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2019-12-17 21:50:07.419807: I tensorflow/core/comm

Copyright 2022 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.