##### Copyright &copy; 2020 The TensorFlow Authors.

This is a modified version of the [census](https://github.com/tensorflow/tfx/blob/master/docs/tutorials/transform/census.ipynb) example combining it with the [sentiment_example](https://github.com/tensorflow/transform/blob/master/examples/sentiment_example.py), that trains a text classifier on the IMDB dataset.

I have made this to experiment with other data sources, and other network architectures. this is the first step in that allowing text classification based on CSV input data.

Plans to extend
1. Replace the liniar classifier tf.estimator.LinearClassifier with a RNN, i cant se any prebuild RNN estimatores https://www.tensorflow.org/guide/estimator so properly a custom estimator have to be build
2. Extend the example to be a multi-class text classifier, properly using 20 newsgroup dataset


In [17]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Preprocessing data with TensorFlow Transform
***The Feature Engineering Component of TensorFlow Extended (TFX)***

This example colab notebook provides a somewhat more advanced example of how <a target='_blank' href='https://www.tensorflow.org/tfx/transform/'>TensorFlow Transform</a> (`tf.Transform`) can be used to preprocess data using exactly the same code for both training a model and serving inferences in production.

TensorFlow Transform is a library for preprocessing input data for TensorFlow, including creating features that require a full pass over the training dataset.  For example, using TensorFlow Transform you could:

* Normalize an input value by using the mean and standard deviation
* Convert strings to integers by generating a vocabulary over all of the input values
* Convert floats to integers by assigning them to buckets, based on the observed data distribution

TensorFlow has built-in support for manipulations on a single example or a batch of examples. `tf.Transform` extends these capabilities to support full passes over the entire training dataset.

The output of `tf.Transform` is exported as a TensorFlow graph which you can use for both training and serving. Using the same graph for both training and serving can prevent skew, since the same transformations are applied in both stages.

Key Point: In order to understand `tf.Transform` and how it works with Apache Beam, you'll need to know a little bit about Apache Beam itself.  The <a target='_blank' href='https://beam.apache.org/documentation/programming-guide/'>Beam Programming Guide</a> is a great place to start.

## Python check, imports, and globals
First we'll make sure that we're using Python 3, and then go ahead and install and import the stuff we need.

In [2]:
import sys

# Confirm that we're using Python 3
assert sys.version_info.major is 3, 'Oops, not running Python 3. Use Runtime > Change runtime type'

In [3]:
# Path to rain and test data
# should be split prior to running this, however here we use same file, and we can't trust the test accuracy
train = 'movie_data.csv'
test = 'movie_data.csv'

In [4]:
import os
import pprint

import tensorflow as tf
print('TF: {}'.format(tf.__version__))

print('Installing Apache Beam')
!pip install -Uq apache_beam==2.20.0
import apache_beam as beam
print('Beam: {}'.format(beam.__version__))

print('Installing TensorFlow Transform')
!pip install -q tensorflow-transform==0.22.0
import tensorflow_transform as tft
print('Transform: {}'.format(tft.__version__))

import tensorflow_transform.beam as tft_beam


TF: 2.1.0
Installing Apache Beam
Beam: 2.20.0
Installing TensorFlow Transform
Transform: 0.22.0


### Define our features and schema
Let's define a schema based on what types the columns are in our input.  Among other things this will help with importing them correctly.

In [5]:
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import schema_utils

VOCAB_SIZE = 20000

REVIEW_KEY = 'text'
REVIEW_WEIGHT_KEY = 'review_weight'
LABEL_KEY = 'label'

RAW_DATA_FEATURE_SPEC = {
    REVIEW_KEY: tf.io.FixedLenFeature([], tf.string),
    LABEL_KEY: tf.io.FixedLenFeature([], tf.int64)
}

RAW_DATA_METADATA = dataset_metadata.DatasetMetadata(
    schema_utils.schema_from_feature_spec(RAW_DATA_FEATURE_SPEC))

DELIMITERS = '.,!?() '

### Setting hyperparameters and basic housekeeping
Constants and hyperparameters used for training.  The bucket size includes all listed categories in the dataset description as well as one extra for "?" which represents unknown.

Note: The number of instances will be computed by `tf.Transform` in future versions, in which case it can be read from the metadata.  Similarly BUCKET_SIZES will not be needed as this information will be stored in the metadata for each of the columns.

In [6]:
testing = os.getenv("WEB_TEST_BROWSER", False)
if testing:
  TRAIN_NUM_EPOCHS = 1
  NUM_TRAIN_INSTANCES = 1
  TRAIN_BATCH_SIZE = 1
  NUM_TEST_INSTANCES = 1
else:
  TRAIN_NUM_EPOCHS = 16
  NUM_TRAIN_INSTANCES = 32561
  TRAIN_BATCH_SIZE = 128
  NUM_TEST_INSTANCES = 16281

# Names of temp files
TRANSFORMED_TRAIN_DATA_FILEBASE = 'train_transformed'
TRANSFORMED_TEST_DATA_FILEBASE = 'test_transformed'
EXPORTED_MODEL_DIR = 'exported_model_dir'

## Preprocessing with `tf.Transform`

### Create a `tf.Transform` preprocessing_fn
The _preprocessing function_ is the most important concept of tf.Transform. A preprocessing function is where the transformation of the dataset really happens. It accepts and returns a dictionary of tensors, where a tensor means a [`Tensor`](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/Tensor) or [`SparseTensor`](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/SparseTensor). There are two main groups of API calls that typically form the heart of a preprocessing function:

1. **TensorFlow Ops:** Any function that accepts and returns tensors, which usually means TensorFlow ops. These add TensorFlow operations to the graph that transforms raw data into transformed data one feature vector at a time.  These will run for every example, during both training and serving.
2. **TensorFlow Transform Analyzers:** Any of the analyzers provided by tf.Transform. Analyzers also accept and return tensors, but unlike TensorFlow ops they only run once, during training, and typically make a full pass over the entire training dataset. They create [tensor constants](https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/constant), which are added to your graph. For example, `tft.min` computes the minimum of a tensor over the training dataset. tf.Transform provides a fixed set of analyzers, but this will be extended in future versions.

Caution: When you apply your preprocessing function to serving inferences, the constants that were created by analyzers during training do not change.  If your data has trend or seasonality components, plan accordingly.

In [8]:
#https://github.com/tensorflow/transform/blob/3375730685445d1f8b3b6a696c851087bb6b6441/examples/sentiment_example.py

def preprocessing_fn(inputs):
    """Preprocess input columns into transformed columns."""
    review = inputs[REVIEW_KEY]

    # Here tf.compat.v1.string_split behaves differently from
    # tf.strings.split.
    review_tokens = tf.compat.v1.string_split(review, DELIMITERS)
    review_indices = tft.compute_and_apply_vocabulary(
        review_tokens, top_k=VOCAB_SIZE)
    # Add one for the oov bucket created by compute_and_apply_vocabulary.
    review_bow_indices, review_weight = tft.tfidf(review_indices,
                                                  VOCAB_SIZE + 1)
    return {
        REVIEW_KEY: review_bow_indices,
        REVIEW_WEIGHT_KEY: review_weight,
        LABEL_KEY: inputs[LABEL_KEY]
    }


### Transform the data
Now we're ready to start transforming our data in an Apache Beam pipeline.

1. Read in the data using the CSV reader
1. Transform it using a preprocessing pipeline that scales numeric data and converts categorical data from strings to int64 values indices, by creating a vocabulary for each category
1. Write out the result as a `TFRecord` of `Example` protos, which we will use for training a model later

<aside class="key-term"><b>Key Term:</b> <a target='_blank' href='https://beam.apache.org/'>Apache Beam</a> uses a <a target='_blank' href='https://beam.apache.org/documentation/programming-guide/#applying-transforms'>special syntax to define and invoke transforms</a>.  For example, in this line:

<code><blockquote>result = pass_this | 'name this step' >> to_this_call</blockquote></code>

The method <code>to_this_call</code> is being invoked and passed the object called <code>pass_this</code>, and <a target='_blank' href='https://stackoverflow.com/questions/50519662/what-does-the-redirection-mean-in-apache-beam-python'>this operation will be referred to as <code>name this step</code> in a stack trace</a>.  The result of the call to <code>to_this_call</code> is returned in <code>result</code>.  You will often see stages of a pipeline chained together like this:

<code><blockquote>result = apache_beam.Pipeline() | 'first step' >> do_this_first() | 'second step' >> do_this_last()</blockquote></code>

and since that started with a new pipeline, you can continue like this:

<code><blockquote>next_result = result | 'doing more stuff' >> another_function()</blockquote></code></aside>

In [10]:
def transform_data(train_data_file, test_data_file, working_dir):
  """Transform the data and write out as a TFRecord of Example protos.

  Read in the data using the CSV reader, and transform it using 
  a preprocessing pipeline that removes punctuation, tokenizes 
  and maps tokens to int64 values indices.

  Args:
    train_data_file: File containing training data
    test_data_file: File containing test data
    working_dir: Directory to write transformed data and metadata to
  """

  # The "with" block will create a pipeline, and run that pipeline at the exit
  # of the block.
  with beam.Pipeline() as pipeline:
    with tft_beam.Context(temp_dir=tempfile.mkdtemp()):
      # Create a coder to read the imdb data with the schema.  To do this we
      # need to list all columns in order since the schema doesn't specify the
      # order of columns in the csv.
      ordered_columns = ['text', 'label']
      converter = tft.coders.CsvCoder(ordered_columns, RAW_DATA_METADATA.schema)

      # Read in raw data and convert using CSV converter.  Note that we apply
      # some Beam transformations here, which will not be encoded in the TF
      # graph since we don't do them from within tf.Transform's methods
      # (AnalyzeDataset, TransformDataset etc.).  These transformations are just
      # to get data into a format that the CSV converter can read, in particular
      # removing spaces after commas.
      raw_data = (
          pipeline
          | 'ReadTrainData' >> beam.io.ReadFromText(train_data_file, skip_header_lines=1)
          | 'FixCommasTrainData' >> beam.Map(
              lambda line: line.replace(', ', ','))
          | 'DecodeTrainData' >> beam.Map(converter.decode)
      )

      # Combine data and schema into a dataset tuple.  Note that we already used
      # the schema to read the CSV data, but we also need it to interpret
      # raw_data.
      raw_dataset = (raw_data, RAW_DATA_METADATA)
      transformed_dataset, transform_fn = (
          raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
      transformed_data, transformed_metadata = transformed_dataset
      transformed_data_coder = tft.coders.ExampleProtoCoder(
          transformed_metadata.schema)

      _ = (
          transformed_data
          | 'EncodeTrainData' >> beam.Map(transformed_data_coder.encode)
          | 'WriteTrainData' >> beam.io.WriteToTFRecord(
              os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE)))

      # Now apply transform function to test data.  In this case we remove the
      # trailing period at the end of each line, and also ignore the header line
      # that is present in the test data file.
      raw_test_data = (
          pipeline
          | 'ReadTestData' >> beam.io.ReadFromText(test_data_file,
                                                   skip_header_lines=1)
          | 'FixCommasTestData' >> beam.Map(
              lambda line: line.replace(', ', ','))
          | 'DecodeTestData' >> beam.Map(converter.decode)
      )

      raw_test_dataset = (raw_test_data, RAW_DATA_METADATA)

      transformed_test_dataset = (
          (raw_test_dataset, transform_fn) | tft_beam.TransformDataset())
      # Don't need transformed data schema, it's the same as before.
      transformed_test_data, _ = transformed_test_dataset

      _ = (
          transformed_test_data
          | 'EncodeTestData' >> beam.Map(transformed_data_coder.encode)
          | 'WriteTestData' >> beam.io.WriteToTFRecord(
              os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE)))

      # Will write a SavedModel and metadata to working_dir, which can then
      # be read by the tft.TFTransformOutput class.
      _ = (
          transform_fn
          | 'WriteTransformFn' >> tft_beam.WriteTransformFn(working_dir))

## Using our preprocessed data to train a model

To show how `tf.Transform` enables us to use the same code for both training and serving, and thus prevent skew, we're going to train a model.  To train our model and prepare our trained model for production we need to create input functions.  The main difference between our training input function and our serving input function is that training data contains the labels, and production data does not.  The arguments and returns are also somewhat different.

### Create an input function for training

In [11]:
def _make_training_input_fn(tf_transform_output, transformed_examples,
                            batch_size):
  """Creates an input function reading from transformed data.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.
    transformed_examples: Base filename of examples.
    batch_size: Batch size.

  Returns:
    The input function for training or eval.
  """
  def input_fn():
    """Input function for training and eval."""
    dataset = tf.data.experimental.make_batched_features_dataset(
        file_pattern=transformed_examples,
        batch_size=batch_size,
        features=tf_transform_output.transformed_feature_spec(),
        reader=tf.data.TFRecordDataset,
        shuffle=True)

    transformed_features = tf.compat.v1.data.make_one_shot_iterator(
        dataset).get_next()

    # Extract features and label from the transformed tensors.
    transformed_labels = transformed_features.pop(LABEL_KEY)

    return transformed_features, transformed_labels

  return input_fn

### Create an input function for serving

Let's create an input function that we could use in production, and prepare our trained model for serving.

In [12]:
def _make_serving_input_fn(tf_transform_output):
  """Creates an input function reading from raw data.

  Args:
    tf_transform_output: Wrapper around output of tf.Transform.

  Returns:
    The serving input function.
  """
  raw_feature_spec = RAW_DATA_FEATURE_SPEC.copy()
  # Remove label since it is not available during serving.
  raw_feature_spec.pop(LABEL_KEY)

  def serving_input_fn():
    """Input function for serving."""
    # Get raw features by generating the basic serving input_fn and calling it.
    # Here we generate an input_fn that expects a parsed Example proto to be fed
    # to the model at serving time.  See also
    # tf.estimator.export.build_raw_serving_input_receiver_fn.
    raw_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(
        raw_feature_spec, default_batch_size=None)
    serving_input_receiver = raw_input_fn()

    # Apply the transform function that was used to generate the materialized
    # data.
    raw_features = serving_input_receiver.features
    transformed_features = tf_transform_output.transform_raw_features(
        raw_features)

    return tf.estimator.export.ServingInputReceiver(
        transformed_features, serving_input_receiver.receiver_tensors)

  return serving_input_fn

### Wrap our input data in FeatureColumns
Our model will expect our data in TensorFlow FeatureColumns.

In [13]:
def get_feature_columns(tf_transform_output):
  """Returns the FeatureColumns for the model.
  Args:
    tf_transform_output: A `TFTransformOutput` object.
  Returns:
    A list of FeatureColumns.
  """
  del tf_transform_output  # unused
  # Unrecognized tokens are represented by -1, but
  # categorical_column_with_identity uses the mod operator to map integers
  # to the range [0, bucket_size).  By choosing bucket_size=VOCAB_SIZE + 1, we
  # represent unrecognized tokens as VOCAB_SIZE.
  review_column = tf.feature_column.categorical_column_with_identity(
      REVIEW_KEY, num_buckets=VOCAB_SIZE + 1)
  weighted_reviews = tf.feature_column.weighted_categorical_column(
      review_column, REVIEW_WEIGHT_KEY)

  return [weighted_reviews]

## Train, Evaluate, and Export our model

In [14]:
def train_and_evaluate(working_dir, num_train_instances=NUM_TRAIN_INSTANCES,
                       num_test_instances=NUM_TEST_INSTANCES):
  """Train the model on training data and evaluate on test data.

  Args:
    working_dir: Directory to read transformed data and metadata from and to
        write exported model to.
    num_train_instances: Number of instances in train set
    num_test_instances: Number of instances in test set

  Returns:
    The results from the estimator's 'evaluate' method
  """
  tf_transform_output = tft.TFTransformOutput(working_dir)

  run_config = tf.estimator.RunConfig()

  estimator = tf.estimator.LinearClassifier(
      feature_columns=get_feature_columns(tf_transform_output),
      config=run_config,
      loss_reduction=tf.losses.Reduction.SUM)

  # Fit the model using the default optimizer.
  train_input_fn = _make_training_input_fn(
      tf_transform_output,
      os.path.join(working_dir, TRANSFORMED_TRAIN_DATA_FILEBASE + '*'),
      batch_size=TRAIN_BATCH_SIZE)
  estimator.train(
      input_fn=train_input_fn,
      max_steps=TRAIN_NUM_EPOCHS * num_train_instances / TRAIN_BATCH_SIZE)

  # Evaluate model on test dataset.
  eval_input_fn = _make_training_input_fn(
      tf_transform_output,
      os.path.join(working_dir, TRANSFORMED_TEST_DATA_FILEBASE + '*'),
      batch_size=1)

  # Export the model.
  serving_input_fn = _make_serving_input_fn(tf_transform_output)
  exported_model_dir = os.path.join(working_dir, EXPORTED_MODEL_DIR)
  estimator.export_saved_model(exported_model_dir, serving_input_fn)

  return estimator.evaluate(input_fn=eval_input_fn, steps=num_test_instances)

## Put it all together
We've created all the stuff we need to preprocess our imdb data, train a model, and prepare it for serving.  So far we've just been getting things ready.  It's time to start running!

Note: Scroll the output from this cell to see the whole process.  The results will be at the bottom.

In [15]:
import tempfile
temp = tempfile.gettempdir()

transform_data(train, test, "")
results = train_and_evaluate("")
pprint.pprint(results)











Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


Instructions for updating:
This function will only be available through the v1 compatibility library as tf.compat.v1.saved_model.utils.build_tensor_info or tf.compat.v1.saved_model.build_tensor_info.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\2c241122c30847f290780b6ab9ff58ec\saved_model.pb


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\2c241122c30847f290780b6ab9ff58ec\saved_model.pb


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\2897f5dbb50a46a4b141fac50c08c0dd\saved_model.pb


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\2897f5dbb50a46a4b141fac50c08c0dd\saved_model.pb


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:No assets to write.


INFO:tensorflow:No assets to write.


'Counter' object has no attribute 'name'


'Counter' object has no attribute 'name'


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\26512e211c7a442e9bfa16d67111976e\saved_model.pb


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\26512e211c7a442e9bfa16d67111976e\saved_model.pb








  name: "label"
  type: INT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "text"
  type: BYTES
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
}), (<PCollection[AnalyzeAndTransformDataset/AnalyzeDataset/CreateSavedModel/BindTensors/ReplaceWithConstants.None] at 0x193476de848>, BeamDatasetMetadata(dataset_metadata={'_schema': feature {
  name: "label"
  type: INT
  presence {
    min_fraction: 1.0
  }
  shape {
  }
}
feature {
  name: "review_weight"
  type: FLOAT
}
feature {
  name: "text"
  type: INT
}
}, deferred_metadata=<PCollection[AnalyzeAndTransformDataset/AnalyzeDataset/ComputeDeferredMetadata.None] at 0x194d3fcf608>))) belongs to. Thus noop.


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\35283e82bcef484aaf5f43d9aebfc2bc\assets


INFO:tensorflow:Assets written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\35283e82bcef484aaf5f43d9aebfc2bc\assets


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\35283e82bcef484aaf5f43d9aebfc2bc\saved_model.pb


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\35283e82bcef484aaf5f43d9aebfc2bc\saved_model.pb


  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\7d943ea3f4c54507a4bf2fea145855f0\assets


INFO:tensorflow:Assets written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\7d943ea3f4c54507a4bf2fea145855f0\assets


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\7d943ea3f4c54507a4bf2fea145855f0\saved_model.pb


INFO:tensorflow:SavedModel written to: C:\Users\the\AppData\Local\Temp\tmpdagu9i60\tftransform_tmp\7d943ea3f4c54507a4bf2fea145855f0\saved_model.pb


  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore






INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\the\\AppData\\Local\\Temp\\tmpymyz76xi', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


INFO:tensorflow:Using config: {'_model_dir': 'C:\\Users\\the\\AppData\\Local\\Temp\\tmpymyz76xi', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


Instructions for updating:
Please use `layer.add_weight` method instead.


Instructions for updating:
Please use `layer.add_weight` method instead.


Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Create CheckpointSaverHook.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Saving checkpoints for 0 into C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt.


INFO:tensorflow:Saving checkpoints for 0 into C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt.


INFO:tensorflow:loss = 88.72284, step = 0


INFO:tensorflow:loss = 88.72284, step = 0


INFO:tensorflow:global_step/sec: 94.102


INFO:tensorflow:global_step/sec: 94.102


INFO:tensorflow:loss = 75.59761, step = 100 (1.065 sec)


INFO:tensorflow:loss = 75.59761, step = 100 (1.065 sec)


INFO:tensorflow:global_step/sec: 107.028


INFO:tensorflow:global_step/sec: 107.028


INFO:tensorflow:loss = 65.89246, step = 200 (0.935 sec)


INFO:tensorflow:loss = 65.89246, step = 200 (0.935 sec)


INFO:tensorflow:global_step/sec: 107.576


INFO:tensorflow:global_step/sec: 107.576


INFO:tensorflow:loss = 61.948456, step = 300 (0.929 sec)


INFO:tensorflow:loss = 61.948456, step = 300 (0.929 sec)


INFO:tensorflow:global_step/sec: 103.643


INFO:tensorflow:global_step/sec: 103.643


INFO:tensorflow:loss = 57.806564, step = 400 (0.966 sec)


INFO:tensorflow:loss = 57.806564, step = 400 (0.966 sec)


INFO:tensorflow:global_step/sec: 106.313


INFO:tensorflow:global_step/sec: 106.313


INFO:tensorflow:loss = 55.763664, step = 500 (0.940 sec)


INFO:tensorflow:loss = 55.763664, step = 500 (0.940 sec)


INFO:tensorflow:global_step/sec: 112.736


INFO:tensorflow:global_step/sec: 112.736


INFO:tensorflow:loss = 56.44605, step = 600 (0.888 sec)


INFO:tensorflow:loss = 56.44605, step = 600 (0.888 sec)


INFO:tensorflow:global_step/sec: 116.148


INFO:tensorflow:global_step/sec: 116.148


INFO:tensorflow:loss = 54.172916, step = 700 (0.860 sec)


INFO:tensorflow:loss = 54.172916, step = 700 (0.860 sec)


INFO:tensorflow:global_step/sec: 114.445


INFO:tensorflow:global_step/sec: 114.445


INFO:tensorflow:loss = 50.458836, step = 800 (0.874 sec)


INFO:tensorflow:loss = 50.458836, step = 800 (0.874 sec)


INFO:tensorflow:global_step/sec: 116.099


INFO:tensorflow:global_step/sec: 116.099


INFO:tensorflow:loss = 54.180107, step = 900 (0.861 sec)


INFO:tensorflow:loss = 54.180107, step = 900 (0.861 sec)


INFO:tensorflow:global_step/sec: 113.178


INFO:tensorflow:global_step/sec: 113.178


INFO:tensorflow:loss = 53.16276, step = 1000 (0.884 sec)


INFO:tensorflow:loss = 53.16276, step = 1000 (0.884 sec)


INFO:tensorflow:global_step/sec: 114.547


INFO:tensorflow:global_step/sec: 114.547


INFO:tensorflow:loss = 48.398293, step = 1100 (0.873 sec)


INFO:tensorflow:loss = 48.398293, step = 1100 (0.873 sec)


INFO:tensorflow:global_step/sec: 115.784


INFO:tensorflow:global_step/sec: 115.784


INFO:tensorflow:loss = 44.181213, step = 1200 (0.865 sec)


INFO:tensorflow:loss = 44.181213, step = 1200 (0.865 sec)


INFO:tensorflow:global_step/sec: 116.959


INFO:tensorflow:global_step/sec: 116.959


INFO:tensorflow:loss = 45.790894, step = 1300 (0.853 sec)


INFO:tensorflow:loss = 45.790894, step = 1300 (0.853 sec)


INFO:tensorflow:global_step/sec: 115.873


INFO:tensorflow:global_step/sec: 115.873


INFO:tensorflow:loss = 47.93828, step = 1400 (0.864 sec)


INFO:tensorflow:loss = 47.93828, step = 1400 (0.864 sec)


INFO:tensorflow:global_step/sec: 113.803


INFO:tensorflow:global_step/sec: 113.803


INFO:tensorflow:loss = 41.588318, step = 1500 (0.880 sec)


INFO:tensorflow:loss = 41.588318, step = 1500 (0.880 sec)


INFO:tensorflow:global_step/sec: 113.809


INFO:tensorflow:global_step/sec: 113.809


INFO:tensorflow:loss = 52.57949, step = 1600 (0.878 sec)


INFO:tensorflow:loss = 52.57949, step = 1600 (0.878 sec)


INFO:tensorflow:global_step/sec: 114.566


INFO:tensorflow:global_step/sec: 114.566


INFO:tensorflow:loss = 45.040047, step = 1700 (0.874 sec)


INFO:tensorflow:loss = 45.040047, step = 1700 (0.874 sec)


INFO:tensorflow:global_step/sec: 115.917


INFO:tensorflow:global_step/sec: 115.917


INFO:tensorflow:loss = 45.69735, step = 1800 (0.862 sec)


INFO:tensorflow:loss = 45.69735, step = 1800 (0.862 sec)


INFO:tensorflow:global_step/sec: 114.684


INFO:tensorflow:global_step/sec: 114.684


INFO:tensorflow:loss = 38.66687, step = 1900 (0.872 sec)


INFO:tensorflow:loss = 38.66687, step = 1900 (0.872 sec)


INFO:tensorflow:global_step/sec: 117.024


INFO:tensorflow:global_step/sec: 117.024


INFO:tensorflow:loss = 43.128967, step = 2000 (0.856 sec)


INFO:tensorflow:loss = 43.128967, step = 2000 (0.856 sec)


INFO:tensorflow:global_step/sec: 114.688


INFO:tensorflow:global_step/sec: 114.688


INFO:tensorflow:loss = 37.956184, step = 2100 (0.871 sec)


INFO:tensorflow:loss = 37.956184, step = 2100 (0.871 sec)


INFO:tensorflow:global_step/sec: 109.213


INFO:tensorflow:global_step/sec: 109.213


INFO:tensorflow:loss = 41.78333, step = 2200 (0.916 sec)


INFO:tensorflow:loss = 41.78333, step = 2200 (0.916 sec)


INFO:tensorflow:global_step/sec: 113.234


INFO:tensorflow:global_step/sec: 113.234


INFO:tensorflow:loss = 38.69662, step = 2300 (0.883 sec)


INFO:tensorflow:loss = 38.69662, step = 2300 (0.883 sec)


INFO:tensorflow:global_step/sec: 115.054


INFO:tensorflow:global_step/sec: 115.054


INFO:tensorflow:loss = 42.41269, step = 2400 (0.869 sec)


INFO:tensorflow:loss = 42.41269, step = 2400 (0.869 sec)


INFO:tensorflow:global_step/sec: 115.343


INFO:tensorflow:global_step/sec: 115.343


INFO:tensorflow:loss = 38.093376, step = 2500 (0.867 sec)


INFO:tensorflow:loss = 38.093376, step = 2500 (0.867 sec)


INFO:tensorflow:global_step/sec: 114.813


INFO:tensorflow:global_step/sec: 114.813


INFO:tensorflow:loss = 36.024292, step = 2600 (0.871 sec)


INFO:tensorflow:loss = 36.024292, step = 2600 (0.871 sec)


INFO:tensorflow:global_step/sec: 117.152


INFO:tensorflow:global_step/sec: 117.152


INFO:tensorflow:loss = 41.055367, step = 2700 (0.854 sec)


INFO:tensorflow:loss = 41.055367, step = 2700 (0.854 sec)


INFO:tensorflow:global_step/sec: 117.379


INFO:tensorflow:global_step/sec: 117.379


INFO:tensorflow:loss = 36.75254, step = 2800 (0.852 sec)


INFO:tensorflow:loss = 36.75254, step = 2800 (0.852 sec)


INFO:tensorflow:global_step/sec: 108.46


INFO:tensorflow:global_step/sec: 108.46


INFO:tensorflow:loss = 50.256046, step = 2900 (0.922 sec)


INFO:tensorflow:loss = 50.256046, step = 2900 (0.922 sec)


INFO:tensorflow:global_step/sec: 115.379


INFO:tensorflow:global_step/sec: 115.379


INFO:tensorflow:loss = 34.87541, step = 3000 (0.867 sec)


INFO:tensorflow:loss = 34.87541, step = 3000 (0.867 sec)


INFO:tensorflow:global_step/sec: 113.159


INFO:tensorflow:global_step/sec: 113.159


INFO:tensorflow:loss = 42.128975, step = 3100 (0.885 sec)


INFO:tensorflow:loss = 42.128975, step = 3100 (0.885 sec)


INFO:tensorflow:global_step/sec: 116.502


INFO:tensorflow:global_step/sec: 116.502


INFO:tensorflow:loss = 38.17366, step = 3200 (0.857 sec)


INFO:tensorflow:loss = 38.17366, step = 3200 (0.857 sec)


INFO:tensorflow:global_step/sec: 116.279


INFO:tensorflow:global_step/sec: 116.279


INFO:tensorflow:loss = 39.303467, step = 3300 (0.860 sec)


INFO:tensorflow:loss = 39.303467, step = 3300 (0.860 sec)


INFO:tensorflow:global_step/sec: 115.125


INFO:tensorflow:global_step/sec: 115.125


INFO:tensorflow:loss = 42.635345, step = 3400 (0.869 sec)


INFO:tensorflow:loss = 42.635345, step = 3400 (0.869 sec)


INFO:tensorflow:global_step/sec: 116.685


INFO:tensorflow:global_step/sec: 116.685


INFO:tensorflow:loss = 35.69165, step = 3500 (0.857 sec)


INFO:tensorflow:loss = 35.69165, step = 3500 (0.857 sec)


INFO:tensorflow:global_step/sec: 117.094


INFO:tensorflow:global_step/sec: 117.094


INFO:tensorflow:loss = 30.901636, step = 3600 (0.854 sec)


INFO:tensorflow:loss = 30.901636, step = 3600 (0.854 sec)


INFO:tensorflow:global_step/sec: 107.088


INFO:tensorflow:global_step/sec: 107.088


INFO:tensorflow:loss = 34.713566, step = 3700 (0.934 sec)


INFO:tensorflow:loss = 34.713566, step = 3700 (0.934 sec)


INFO:tensorflow:global_step/sec: 115.205


INFO:tensorflow:global_step/sec: 115.205


INFO:tensorflow:loss = 36.199425, step = 3800 (0.868 sec)


INFO:tensorflow:loss = 36.199425, step = 3800 (0.868 sec)


INFO:tensorflow:global_step/sec: 114.683


INFO:tensorflow:global_step/sec: 114.683


INFO:tensorflow:loss = 32.81576, step = 3900 (0.873 sec)


INFO:tensorflow:loss = 32.81576, step = 3900 (0.873 sec)


INFO:tensorflow:global_step/sec: 111.982


INFO:tensorflow:global_step/sec: 111.982


INFO:tensorflow:loss = 30.346886, step = 4000 (0.892 sec)


INFO:tensorflow:loss = 30.346886, step = 4000 (0.892 sec)


INFO:tensorflow:Saving checkpoints for 4071 into C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt.


INFO:tensorflow:Saving checkpoints for 4071 into C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt.


INFO:tensorflow:Loss for final step: 36.533607.


INFO:tensorflow:Loss for final step: 36.533607.


  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



  tensor_info {
    name: "Const_1:0"
  }
  filename: "vocab_compute_and_apply_vocabulary_vocabulary"
}



INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Saver not created because there are no variables in the graph to restore


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Signatures INCLUDED in export for Classify: ['serving_default', 'classification']


INFO:tensorflow:Signatures INCLUDED in export for Classify: ['serving_default', 'classification']


INFO:tensorflow:Signatures INCLUDED in export for Regress: ['regression']


INFO:tensorflow:Signatures INCLUDED in export for Regress: ['regression']


INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']


INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']


INFO:tensorflow:Signatures INCLUDED in export for Train: None


INFO:tensorflow:Signatures INCLUDED in export for Train: None


INFO:tensorflow:Signatures INCLUDED in export for Eval: None


INFO:tensorflow:Signatures INCLUDED in export for Eval: None


INFO:tensorflow:Restoring parameters from C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt-4071


INFO:tensorflow:Restoring parameters from C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt-4071


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets added to graph.


INFO:tensorflow:Assets written to: exported_model_dir\temp-b'1591341342'\assets


INFO:tensorflow:Assets written to: exported_model_dir\temp-b'1591341342'\assets


INFO:tensorflow:SavedModel written to: exported_model_dir\temp-b'1591341342'\saved_model.pb


INFO:tensorflow:SavedModel written to: exported_model_dir\temp-b'1591341342'\saved_model.pb


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Done calling model_fn.


INFO:tensorflow:Starting evaluation at 2020-06-05T09:15:43Z


INFO:tensorflow:Starting evaluation at 2020-06-05T09:15:43Z


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Graph was finalized.


INFO:tensorflow:Restoring parameters from C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt-4071


INFO:tensorflow:Restoring parameters from C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt-4071


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Done running local_init_op.


INFO:tensorflow:Evaluation [1628/16281]


INFO:tensorflow:Evaluation [1628/16281]


INFO:tensorflow:Evaluation [3256/16281]


INFO:tensorflow:Evaluation [3256/16281]


INFO:tensorflow:Evaluation [4884/16281]


INFO:tensorflow:Evaluation [4884/16281]


INFO:tensorflow:Evaluation [6512/16281]


INFO:tensorflow:Evaluation [6512/16281]


INFO:tensorflow:Evaluation [8140/16281]


INFO:tensorflow:Evaluation [8140/16281]


INFO:tensorflow:Evaluation [9768/16281]


INFO:tensorflow:Evaluation [9768/16281]


INFO:tensorflow:Evaluation [11396/16281]


INFO:tensorflow:Evaluation [11396/16281]


INFO:tensorflow:Evaluation [13024/16281]


INFO:tensorflow:Evaluation [13024/16281]


INFO:tensorflow:Evaluation [14652/16281]


INFO:tensorflow:Evaluation [14652/16281]


INFO:tensorflow:Evaluation [16280/16281]


INFO:tensorflow:Evaluation [16280/16281]


INFO:tensorflow:Evaluation [16281/16281]


INFO:tensorflow:Evaluation [16281/16281]


INFO:tensorflow:Inference Time : 91.83177s


INFO:tensorflow:Inference Time : 91.83177s


INFO:tensorflow:Finished evaluation at 2020-06-05-09:17:15


INFO:tensorflow:Finished evaluation at 2020-06-05-09:17:15


INFO:tensorflow:Saving dict for global step 4071: accuracy = 0.9137031, accuracy_baseline = 0.5011363, auc = 0.97100824, auc_precision_recall = 0.9701615, average_loss = 0.28369084, global_step = 4071, label/mean = 0.5011363, loss = 0.28369084, precision = 0.91152817, prediction/mean = 0.4994619, recall = 0.91677904


INFO:tensorflow:Saving dict for global step 4071: accuracy = 0.9137031, accuracy_baseline = 0.5011363, auc = 0.97100824, auc_precision_recall = 0.9701615, average_loss = 0.28369084, global_step = 4071, label/mean = 0.5011363, loss = 0.28369084, precision = 0.91152817, prediction/mean = 0.4994619, recall = 0.91677904


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 4071: C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt-4071


INFO:tensorflow:Saving 'checkpoint_path' summary for global step 4071: C:\Users\the\AppData\Local\Temp\tmpymyz76xi\model.ckpt-4071


{'accuracy': 0.9137031,
 'accuracy_baseline': 0.5011363,
 'auc': 0.97100824,
 'auc_precision_recall': 0.9701615,
 'average_loss': 0.28369084,
 'global_step': 4071,
 'label/mean': 0.5011363,
 'loss': 0.28369084,
 'precision': 0.91152817,
 'prediction/mean': 0.4994619,
 'recall': 0.91677904}


## What we did  

In this example we used `tf.Transform` to preprocess a dataset of imdb data, and train a model with the cleaned and transformed data.  We also created an input function that we could use when we deploy our trained model in a production environment to perform inference.  By using the same code for both training and inference we avoid any issues with data skew.  Along the way we learned about creating an Apache Beam transform to perform the transformation that we needed for cleaning the data, and wrapped our data in TensorFlow `FeatureColumns`.  This is just a small piece of what TensorFlow Transform can do!  We encourage you to dive into `tf.Transform` and discover what it can do for you.

## Serving The Model

Serving the model can be done using TFX serving using docker https://www.tensorflow.org/tfx/serving/docker
This way you can serve the model with a RESTfull interface or even with a gRPC interface.

The greatest benefit of using TFX and TFX serving from my point of view is that you are adding the transformations into the compute graph, this means that when serving you dont need to keep track of the word tokenizer of your category encodings, you call the API with Text, and you get back the category names

### Preparing
I tested this on my local windows machine, and here you need to
1. install docker https://docs.docker.com/docker-for-windows/install/
1. Make sure you have shared the path where you model is saved i moved it to c:/temp  
    Open Docker Desktop  
    Go to Settings  
    Go to Resources  
    Go to Filesharing  
    Add the path to your model  

### Serving
Run the Serving container and mounting the model, TFX serving will pick up the model and serve it at rest endpoint at port 8501

`docker run -p 8501:8501 --mount type=bind,source=c:\temp\exported_model_dir,target=/models/imdb -e MODEL_NAME=imdb -t tensorflow/serving &`

### Classifying
Call up endpoint http://localhost:8501/v1/models/imdb:classify

with payload
`{"examples":[{"text":"greatest movie i ever seen"}]}`

I use postman