# ML with TensorFlow Extended (TFX) -- Part 2
The puprpose of this tutorial is to show how to do end-to-end ML with TFX libraries on Google Cloud Platform. This tutorial covers:
1. Data analysis and schema generation with **TF Data Validation**.
2. Data preprocessing with **TF Transform**.
3. Model training with **TF Estimator**.
4. Model evaluation with **TF Model Analysis**.

This notebook has been tested in Jupyter on the Deep Learning VM.

## 0. Setup Python and Cloud environment

Apache Beam support for Python 3 is in alpha at the moment, so we'll do this notebook in Python 2.

In [None]:
!python -m pip install --upgrade grpcio_tools tensorflow_data_validation

In [None]:
#%%bash
# install from source to get latest bug fixes in.
#git clone https://github.com/apache/beam
#cd beam/sdks/python
#python3 setup.py sdist

In [None]:
#%pip install -q --upgrade './beam/sdks/python/dist/apache-beam-2.13.0.dev0.tar.gz[gcp]'

In [13]:
import apache_beam as beam
import platform
import tensorflow as tf
import tensorflow_data_validation as tfdv
import tensorflow_transform as tft
import tornado

print('tornado version: {}'.format(tornado.version))
print('Python version: {}'.format(platform.python_version()))
print('TF version: {}'.format(tf.__version__))
print('TFT version: {}'.format(tft.__version__))
print('TFDV version: {}'.format(tfdv.__version__))
print('Apache Beam version: {}'.format(beam.__version__))

tornado version: 5.1.1
Python version: 2.7.13
TF version: 1.13.1
TFT version: 0.13.0
TFDV version: 0.13.1
Apache Beam version: 2.11.0


In [14]:
PROJECT = 'cloud-training-demos'    # Replace with your PROJECT
BUCKET = 'cloud-training-demos-ml'  # Replace with your BUCKET
REGION = 'us-central1'              # Choose an available region for Cloud MLE

import os

os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION

## ensure we're using python2 env
os.environ['CLOUDSDK_PYTHON'] = 'python2'

In [15]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

## ensure we predict locally with our current Python environment
gcloud config set ml_engine/local_python `which python`

Updated property [core/project].
Updated property [compute/region].
Updated property [ml_engine/local_python].


<img valign="middle" src="images/tfx.jpeg">

### Flights dataset

We'll use the flights dataset from the book [Data Science on Google Cloud Platform](http://shop.oreilly.com/product/0636920057628.do)

In [17]:
DATA_BUCKET = "gs://cloud-training-demos/flights/chapter8/output/"
TRAIN_DATA_PATTERN = DATA_BUCKET + "train*"
EVAL_DATA_PATTERN = DATA_BUCKET + "test*"

In [18]:
CSV_COLUMNS = ('ontime,dep_delay,taxiout,distance,avg_dep_delay,avg_arr_delay' + 
               ',carrier,dep_lat,dep_lon,arr_lat,arr_lon,origin,dest').split(',')
TARGET_FEATURE_NAME = 'ontime'
DEFAULTS     = [[0.0],[0.0],[0.0],[0.0],[0.0],[0.0],\
                ['na'],[0.0],[0.0],[0.0],[0.0],['na'],['na']]

## 2. Data Preprocessing
For data preprocessing and transformation, we use [TensorFlow Transform](https://www.tensorflow.org/tfx/guide/tft) to perform the following:
1. Implement transformation logic in **preprocess_fn**
2. **Analyze and transform** training data.
4. **Transform** evaluation data.
5. Save transformed **data**, transform **schema**, and transform **logic**.

### 2.1 Implement preprocess_fn 

In [19]:
def make_preprocessing_fn(raw_schema):

  def preprocessing_fn(input_features):

    processed_features = {}

    for feature in raw_schema.feature:
      feature_name = feature.name
      
      if feature_name in [TARGET_FEATURE_NAME]:
        processed_features[feature_name] = input_features[feature_name]
      elif feature.type == 1:
        # Extract vocabulary and integerize categorical features.
        processed_features[feature_name+"_integerized"] = (
            tft.compute_and_apply_vocabulary(input_features[feature_name], vocab_filename=feature_name))
      else:
        # normalize numeric features.
        processed_features[feature_name+"_scaled"] = tft.scale_to_z_score(input_features[feature_name])

    # Bucketize some of the numeric features using quantiles.
    quantiles = tft.quantiles(input_features["distance"], num_buckets=5, epsilon=0.01)
    processed_features["distance_bucketized"] = tft.apply_buckets(
      input_features["distance"], bucket_boundaries=quantiles)

    return processed_features

  return preprocessing_fn

### 2.2 Implement the Beam pipeline

In [20]:
def run_pipeline(args):
  import tensorflow_transform as tft
  import tensorflow_transform.beam as tft_beam
  import tensorflow_data_validation as tfdv
  from tensorflow_transform.tf_metadata import dataset_metadata
  from tensorflow_transform.tf_metadata import dataset_schema
  from tensorflow_transform.tf_metadata import schema_utils
    
  pipeline_options = beam.pipeline.PipelineOptions(flags=[], **args)
    
  raw_schema_location = args['raw_schema_location']
  raw_train_data_location = args['raw_train_data_location']
  raw_eval_data_location = args['raw_eval_data_location']
  transformed_train_data_location = args['transformed_train_data_location']
  transformed_eval_data_location = args['transformed_eval_data_location']
  transform_artifact_location = args['transform_artifact_location']
  temporary_dir = args['temporary_dir']
  runner = args['runner']
    
  print ("Raw schema location: {}".format(raw_schema_location))
  print ("Raw train data location: {}".format(raw_train_data_location))
  print ("Raw evaluation data location: {}".format(raw_eval_data_location))
  print ("Transformed train data location: {}".format(transformed_train_data_location))
  print ("Transformed evaluation data location: {}".format(transformed_eval_data_location))
  print ("Transform artifact location: {}".format(transform_artifact_location))
  print ("Temporary directory: {}".format(temporary_dir))
  print ("Runner: {}".format(runner))
  print ("")

  # Load TFDV schema and create tft schema from it.
  source_raw_schema = tfdv.load_schema_text(raw_schema_location)
  raw_feature_spec = schema_utils.schema_as_feature_spec(source_raw_schema).feature_spec
  raw_metadata = dataset_metadata.DatasetMetadata(
    dataset_schema.from_feature_spec(raw_feature_spec))

  with beam.Pipeline(runner, options=pipeline_options) as pipeline:
    with tft_beam.Context(temporary_dir):
      
      converter = tft.coders.CsvCoder(column_names=CSV_COLUMNS, 
        schema=raw_metadata.schema)

      ###### analyze & transform trainining data ###############################

      # Read raw training csv data.
      step = 'Train'
      print ("Reading and parsing raw training data...")
      raw_train_data = (
        pipeline
          | '{} - Read Raw Data'.format(step) >> beam.io.textio.ReadFromText(raw_train_data_location)
          | '{} - Remove Empty Rows'.format(step) >> beam.Filter(lambda line: line)
          | '{} - Decode CSV Data'.format(step) >> beam.Map(converter.decode)
        )
      
      # Create a train dataset from the data and schema.
      raw_train_dataset = (raw_train_data, raw_metadata)

      # Analyze and transform raw_train_dataset to produced transformed_train_dataset and transform_fn.
      print ("Analyzing and transforming raw training data...")
      transformed_train_dataset, transform_fn = (
        raw_train_dataset 
        | '{} - Analyze & Transform'.format(step) >> tft_beam.AnalyzeAndTransformDataset(
              make_preprocessing_fn(source_raw_schema))
      )
  
      # Get data and schema separately from the transformed_train_dataset.
      transformed_train_data, transformed_metadata = transformed_train_dataset

      # write transformed train data to sink.
      print ("Writing transformed training data...")
      _ = (
        transformed_train_data 
          | '{} - Write Transformed Data'.format(step) >> beam.io.tfrecordio.WriteToTFRecord(
            file_path_prefix=transformed_train_data_location,
            file_name_suffix=".tfrecords",
            coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))
        )

      ###### transform evaluation data #########################################

      # Read raw training csv data.
      step = 'Eval'
      print ("Reading and parsing raw evaluation data...")
      raw_eval_data = (
        pipeline
          | '{} - Read Raw Data'.format(step) >> beam.io.textio.ReadFromText(raw_eval_data_location)
          | '{} - Remove Empty Rows'.format(step) >> beam.Filter(lambda line: line)
          | '{} - Decode CSV Data'.format(step) >> beam.Map(converter.decode)
        )
      
      # Create a eval dataset from the data and schema.
      raw_eval_dataset = (raw_eval_data, raw_metadata)

      # Transform eval data based on produced transform_fn.
      print ("Transforming raw evaluation data...")
      transformed_eval_dataset = (
        (raw_eval_dataset, transform_fn) 
          | '{} - Transform'.format(step) >> tft_beam.TransformDataset()
      )

      # Get data from the transformed_eval_dataset.
      transformed_eval_data, _ = transformed_eval_dataset

      # Write transformed eval data to sink.
      print ("Writing transformed evaluation data...")
      _ = (
          transformed_eval_data 
          | '{} - Write Transformed Data'.format(step) >> beam.io.tfrecordio.WriteToTFRecord(
              file_path_prefix=transformed_eval_data_location,
              file_name_suffix=".tfrecords",
              coder=tft.coders.ExampleProtoCoder(transformed_metadata.schema))
        )

      ###### write transformation metadata #######################################################

      # Write transform_fn.
      print ("Writing transform artifacts...")
      _ = (
          transform_fn 
          | 'Write Transform Artifacts' >> tft_beam.WriteTransformFn(
              transform_artifact_location)
      )

          

### 1.4 Run data tranformation pipeline

In [None]:
!python -m pip freeze | grep tensorflow

In [21]:
%%writefile setup.py
from setuptools import setup, find_packages

setup(name='tfxdemo',
      version='1.0',
      packages=find_packages(),
      install_requires=['tensorflow-transform==0.13.0', 
                        'tensorflow-data-validation==0.13.1'],
)

Overwriting setup.py


In [22]:
#runner = 'DirectRunner'; OUTPUT_DIR = 'output/flights/tfx'   # on-prem
#runner = 'DirectRunner'; OUTPUT_DIR = 'gs://{}/flights/tfx'.format(BUCKET)  # hybrid
runner = 'DataflowRunner'; OUTPUT_DIR = 'gs://{}/flights/tfx'.format(BUCKET)  # on GCP

RAW_SCHEMA_LOCATION = 'raw_schema.pbtxt'
TRANSFORM_ARTIFACTS_DIR = os.path.join(OUTPUT_DIR,'transform')
TRANSFORMED_DATA_DIR = os.path.join(OUTPUT_DIR,'transformed')
TEMP_DIR = os.path.join(OUTPUT_DIR, 'tmp')

args = {
    
    'runner': runner,

    'raw_schema_location': RAW_SCHEMA_LOCATION,

    'raw_train_data_location': TRAIN_DATA_PATTERN,
    'raw_eval_data_location': EVAL_DATA_PATTERN,

    'transformed_train_data_location':  os.path.join(TRANSFORMED_DATA_DIR, "train"),
    'transformed_eval_data_location':  os.path.join(TRANSFORMED_DATA_DIR, "eval"),
    'transform_artifact_location':  TRANSFORM_ARTIFACTS_DIR,
    
    'temporary_dir': TEMP_DIR,
    'project': PROJECT,
    'temp_location': TEMP_DIR,
    'staging_location': os.path.join(OUTPUT_DIR, 'staging'),
    'max_num_workers': 8,
    'save_main_session': False,
    'setup_file': './setup.py'
}

In [None]:
if tf.gfile.Exists(OUTPUT_DIR):
  print("Removing {} contents...".format(OUTPUT_DIR))
  tf.gfile.DeleteRecursively(OUTPUT_DIR)

tf.logging.set_verbosity(tf.logging.ERROR)
print("Running TF Transform pipeline...")
print()
run_pipeline(args)
print()
print("Pipeline is done.")

### Check the outputs

In [23]:
!gcloud storage ls $OUTPUT_DIR/*

gs://cloud-training-demos-ml/flights/tfx/

gs://cloud-training-demos-ml/flights/tfx/staging/:
gs://cloud-training-demos-ml/flights/tfx/staging/beamapp-jupyter-0402043210-224816.1554179530.224958/

gs://cloud-training-demos-ml/flights/tfx/tmp/:
gs://cloud-training-demos-ml/flights/tfx/tmp/
gs://cloud-training-demos-ml/flights/tfx/tmp/beamapp-jupyter-0402043210-224816.1554179530.224958/
gs://cloud-training-demos-ml/flights/tfx/tmp/tftransform_tmp/

gs://cloud-training-demos-ml/flights/tfx/transform/:
gs://cloud-training-demos-ml/flights/tfx/transform/
gs://cloud-training-demos-ml/flights/tfx/transform/transform_fn/
gs://cloud-training-demos-ml/flights/tfx/transform/transformed_metadata/

gs://cloud-training-demos-ml/flights/tfx/transformed/:
gs://cloud-training-demos-ml/flights/tfx/transformed/eval-00000-of-00008.tfrecords
gs://cloud-training-demos-ml/flights/tfx/transformed/eval-00001-of-00008.tfrecords
gs://cloud-training-demos-ml/flights/tfx/transformed/eval-00002-of-00008.tfrecords
g

In [24]:
!gcloud storage ls $OUTPUT_DIR/transform/transform_fn

gs://cloud-training-demos-ml/flights/tfx/transform/transform_fn/
gs://cloud-training-demos-ml/flights/tfx/transform/transform_fn/saved_model.pb
gs://cloud-training-demos-ml/flights/tfx/transform/transform_fn/assets/
gs://cloud-training-demos-ml/flights/tfx/transform/transform_fn/variables/


In [25]:
!gcloud storage cat $OUTPUT_DIR/transform/transformed_metadata/schema.pbtxt

feature {
  name: "arr_lat_scaled"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "arr_lon_scaled"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "avg_arr_delay_scaled"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "avg_dep_delay_scaled"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "carrier_integerized"
  type: INT
  int_domain {
    min: -1
    max: 13
    is_categorical: true
  }
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dep_delay_scaled"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
      size: 1
    }
  }
}
feature {
  name: "dep_lat_scaled"
  type: FLOAT
  presence {
    min_fraction: 1.0
  }
  shape {
    dim {
   

## License

Copyright 2019 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

---
This is not an official Google product. The sample code provided for educational purposes only.
---