# TensorFlow Estimators: Train, Evaluate, Export, Explained!

The purporse of this tutorial is to explain how TensorFlow estimator trainining and evaluation work with different configurations, as well as how the model is exported for serving. The tutorial covers the following points:
1. Creating an estimator using the premade **DNNClassifier** for the Census dataset.
2. Parameterizing **input function for training**.
3. Training with **incremental steps vs. total Steps**.
4. Training with **steps vs. epochs**.
5. Controling **checkpoints** creation frequency.
6. Parameterizing **input function for evaluation**
7. **Interwining** training and evaluation.
8. Creating a **serving input function** for exporting the model
9. Configuring **Latest exporter**, **final exporter**, and **best exporter**

<a href="https://colab.research.google.com/github/GoogleCloudPlatform/tf-estimator-tutorials/blob/master/00_Miscellaneous/tf_train_eval_export/Tutorial%20-%20TensorFlow%20Estimator%20Train%2C%20Evaluate%2C%20Export%2C%20Explained.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import os
import pandas as pd
import numpy as np
from datetime import datetime

import tensorflow as tf
from tensorflow import data

print("TensorFlow : {}".format(tf.__version__))

SEED = 19831060

TensorFlow : 1.12.0


## Download Data

In [7]:
DATA_DIR='data'
!mkdir $DATA_DIR
!gsutil cp gs://cloud-samples-data/ml-engine/census/data/adult.data.csv $DATA_DIR
!gsutil cp gs://cloud-samples-data/ml-engine/census/data/adult.test.csv $DATA_DIR

mkdir: data: File exists
Copying gs://cloud-samples-data/ml-engine/census/data/adult.data.csv...
- [1 files][  3.8 MiB/  3.8 MiB]                                                
Operation completed over 1 objects/3.8 MiB.                                      
Copying gs://cloud-samples-data/ml-engine/census/data/adult.test.csv...
- [1 files][  1.9 MiB/  1.9 MiB]                                                
Operation completed over 1 objects/1.9 MiB.                                      


In [8]:
TRAIN_DATA_FILE = os.path.join(DATA_DIR, 'adult.data.csv')
EVAL_DATA_FILE = os.path.join(DATA_DIR, 'adult.test.csv')

In [19]:
!wc -l $TRAIN_DATA_FILE
!wc -l $EVAL_DATA_FILE

   32561 data/adult.data.csv
   16278 data/adult.test.csv


The training data includes 32561 records, while the evaluation data includes 16278 records. We will fix the batch size to 200

In [20]:
TRAIN_DATA_SIZE = 32561
EVAL_DATA_SIZE = 16278
BATCH_SIZE = 200

print("Training data size:{}".format(TRAIN_DATA_SIZE))
print("Number of batches in training data: {}".format(TRAIN_DATA_SIZE/float(BATCH_SIZE)))
print("Evaluation data size:{}".format(EVAL_DATA_SIZE))
print("Number of batches in evaluation data: {}".format(EVAL_DATA_SIZE/float(BATCH_SIZE)))

Training data size:32561
Number of batches in training data: 162.805
Evaluation data size:16278
Number of batches in evaluation data: 81.39


## Dataset Metadata

In [21]:
HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',
               'marital_status', 'occupation', 'relationship', 'race', 'gender',
               'capital_gain', 'capital_loss', 'hours_per_week',
               'native_country', 'income_bracket']

HEADER_DEFAULTS = [[0], [''], [0], [''], [0], [''], [''], [''], [''], [''],
                       [0], [0], [0], [''], ['']]

NUMERIC_FEATURE_NAMES = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
CATEGORICAL_FEATURE_NAMES = ['gender', 'race', 'education', 'marital_status', 'relationship', 
                             'workclass', 'occupation', 'native_country']

FEATURE_NAMES = NUMERIC_FEATURE_NAMES + CATEGORICAL_FEATURE_NAMES
TARGET_NAME = 'income_bracket'
TARGET_LABELS = [' <=50K', ' >50K']
WEIGHT_COLUMN_NAME = 'fnlwgt'

def get_categorical_features_vocabolary():
    data = pd.read_csv(TRAIN_DATA_FILE, names=HEADER)
    return {
        column: list(data[column].unique()) 
        for column in data.columns if column in CATEGORICAL_FEATURE_NAMES
    }

In [22]:
feature_vocabolary = get_categorical_features_vocabolary()
print(feature_vocabolary)

{'workclass': [' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov', ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay', ' Never-worked'], 'relationship': [' Not-in-family', ' Husband', ' Wife', ' Own-child', ' Unmarried', ' Other-relative'], 'gender': [' Male', ' Female'], 'marital_status': [' Never-married', ' Married-civ-spouse', ' Divorced', ' Married-spouse-absent', ' Separated', ' Married-AF-spouse', ' Widowed'], 'race': [' White', ' Black', ' Asian-Pac-Islander', ' Amer-Indian-Eskimo', ' Other'], 'native_country': [' United-States', ' Cuba', ' Jamaica', ' India', ' ?', ' Mexico', ' South', ' Puerto-Rico', ' Honduras', ' England', ' Canada', ' Germany', ' Iran', ' Philippines', ' Italy', ' Poland', ' Columbia', ' Cambodia', ' Thailand', ' Ecuador', ' Laos', ' Taiwan', ' Haiti', ' Portugal', ' Dominican-Republic', ' El-Salvador', ' France', ' Guatemala', ' China', ' Japan', ' Yugoslavia', ' Peru', ' Outlying-US(Guam-USVI-etc)', ' Scotland', ' Trinadad&Tobago', ' Greece',

## Create Estimator
Note that, the purpose of the tutorial is not to create the state-of-the-art model for this dataset. The purpose is to show the mechanisms of training and evaluating a TensorFlow estimator, regardless of the sophistication of the model or its predictive power. Thus, we use a premade tf.estimator.DNNClassifier for our examples. The ideas discussed in this tutorial applies to more complex custom estimators.

In [23]:
import math

def create_feature_columns():
    
    feature_columns = []
    
    for column in NUMERIC_FEATURE_NAMES:
        feature_column = tf.feature_column.numeric_column(column)
        feature_columns.append(feature_column)
        
    for column in CATEGORICAL_FEATURE_NAMES:
        vocabolary = feature_vocabolary[column]
        embed_size = int(math.sqrt(len(vocabolary)))
        feature_column = tf.feature_column.embedding_column(
            tf.feature_column.categorical_column_with_vocabulary_list(column, vocabolary), 
            embed_size)
        feature_columns.append(feature_column)
        
    return feature_columns
                  

def create_estimator(run_config):
    
    feature_columns = create_feature_columns()
    
    estimator = tf.estimator.DNNClassifier(
        feature_columns=feature_columns,
        n_classes=len(TARGET_LABELS),
        label_vocabulary=TARGET_LABELS,
        weight_column=WEIGHT_COLUMN_NAME,
        hidden_units=[100, 70, 50] ,
        dropout=0.2,
        batch_norm=True,
        config=run_config
    )
    
    return estimator

MODELS_LOCATION = 'models/census'
MODEL_NAME = 'dnn_classifier'
model_dir = os.path.join(MODELS_LOCATION, MODEL_NAME)

print(model_dir)

run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    model_dir=model_dir
)

models/census/dnn_classifier


## Data Input Function

In [3]:
def make_input_fn(file_pattern, batch_size, num_epochs, shuffle=False):
    
    def _input_fn():
        dataset = tf.data.experimental.make_csv_dataset(
            file_pattern=file_pattern,
            batch_size=batch_size,
            column_names=HEADER,
            column_defaults=HEADER_DEFAULTS,
            label_name=TARGET_NAME,
            field_delim=',',
            use_quote_delim=True,
            header=False,
            num_epochs=num_epochs,
            shuffle=shuffle
        )
        return dataset
    
    return _input_fn

## Train: Input Function
* Batch size is set
* Epochs is ignored (set to None)

Later we are going to see how to use epochs for training.


In [25]:
train_input_fn = make_input_fn(
    TRAIN_DATA_FILE,
    batch_size=BATCH_SIZE,
    num_epochs=None,
    shuffle=True
)

## Train: Incremental Steps vs. Total Steps
* 1 batch (feed forward pass & backpropagation) corresponds to 1 training step 
* **steps**: Number of steps for which to train model. 'steps' works **incrementally**. Two calls to train(steps=100) means 200 training iterations.
* **max_steps**: Number of **total** steps for which to train model. If set, steps must be None. Two calls to train(max_steps=100) means that the second call will not do any iteration since first call did all 100 steps.


In the following function, **clean_start** flag indicates whether to delete the previous model artefacts (if any), and **incremental** flag indicates whether to use **steps** (for incremental training steps) or **max_steps** (for overall training steps). 

In [26]:
def train_experiment(training_steps, clean_start, incremental, run_config):

    if clean_start == True: 
        if tf.gfile.Exists(run_config.model_dir):
            print("Removing previous artefacts...")
            
            tf.gfile.DeleteRecursively(run_config.model_dir)

    print("")
    estimator = create_estimator(run_config)
    print("")
    
    time_start = datetime.utcnow() 
    print("Experiment started at {}".format(time_start.strftime("%H:%M:%S")))
    print(".......................................") 
   
    if incremental:
        # Use steps parameter
        estimator.train(train_input_fn, steps=training_steps)
    else:
        # Use max_steps parameter
        estimator.train(train_input_fn, max_steps=training_steps)
        
    time_end = datetime.utcnow() 
    print(".......................................")
    print("Experiment finished at {}".format(time_end.strftime("%H:%M:%S")))
    print("")
    time_elapsed = time_end - time_start
    print("Experiment elapsed time: {} seconds".format(time_elapsed.total_seconds()))
    
    return estimator

In [27]:
train_experiment(
    training_steps=1000, 
    clean_start=True,
    incremental=False,
    run_config=run_config
)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x10ca58ed0>, '_model_dir': 'models/census/dnn_classifier', '_protocol': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': 19831060, '_save_summary_steps': 100, '_device_fn': None, '_experimental_distribute': None, '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_evaluation_master': '', '_eval_distribute': None, '_global_id_in_cluster': 0, '_master': ''}

Experiment started at 16:52:35
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x105902610>

Total number of steps 1000.

Lets run this again, with max_steps, without deleting the previous model.

In [87]:
train_experiment(
    training_steps=1000, 
    clean_start=False,
    incremental=False,
    run_config=run_config,
)


INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11fb97150>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:33:57
.......................................
INFO:tensorflow:Skipping training since max_steps has already saved.
.......................................
Experiment finished at 16:33:57

Experiment elapsed time: 0.008284 seconds


<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x121f3b890>

As expected, no training occured and since max_steps was reached.

Now let's try incremetal steps

In [88]:
train_experiment(
    training_steps=1000, 
    clean_start=False,
    incremental=True,
    run_config=run_config
)


INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11fb97150>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:34:05
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-1000
INFO:tensorflow:Running local_i

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x1088a2650>

As shown, the total number of training steps is 2000, starting from step 1000 (from the previous run)

## Train: Steps vs Epochs

While the steps refers to how many **data batchs** are needed for training, the epochs refers to how many times the **whole training data** needs to be used for training. 

While using epochs to define the number of training iteration is a conventional practice in machine learninig, however, when working with very large datasets to train Deep Learning models, batch-level training steps (rather than the whole-training-data-level epochs) are more practical.

In [106]:
num_epochs=3

train_input_fn = make_input_fn(
    TRAIN_DATA_FILE,
    batch_size=BATCH_SIZE,
    num_epochs=num_epochs,
    shuffle=True,
)

In [109]:
expected_training_steps = math.ceil(TRAIN_DATA_SIZE/float(BATCH_SIZE))*num_epochs

print('Training data size: {}'.format(TRAIN_DATA_SIZE))
print('Batch size: {}'.format(BATCH_SIZE) )
print('Number of epochs (supplied): {}'.format(num_epochs))
print('Number of training steps (expected): {}'.format(expected_training_steps))
print('')

train_experiment(
    training_steps=None, 
    clean_start=True,
    incremental=True,
    run_config=run_config
)

Training data size: 32561
Batch size: 200
Number of epochs (supplied): 3
Number of training steps (expected): 489.0

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11fb97150>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:45:32
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:te

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11f699f10>

As expected, the training steps, given 3 epochs (for training data of 32561 records and batch size of 200 records), is 32561, which correspods to: ** ceiling of (TRAIN_DATA_SIZE / BATCH_SIZE) * num_epochs) **

Note that, if both num_epochs (in the train_input_fn) and steps (in estimator.train) are supplied, the model will stop on the earlier criteria.

In [110]:
train_input_fn = make_input_fn(
    TRAIN_DATA_FILE,
    batch_size=BATCH_SIZE,
    num_epochs=1000,
    shuffle=True,
)

train_experiment(
    training_steps=10, # the model will train for only 10 steps, ignoring the 1000 epochs
    clean_start=True,
    incremental=True,
    run_config=run_config
)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11fb97150>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:47:35
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11453d250>

In [111]:
train_input_fn = make_input_fn(
    TRAIN_DATA_FILE,
    batch_size=BATCH_SIZE,
    num_epochs=1, # the model will train for only 1 epoch (55 steps), ignoring the 1000 steps
    shuffle=True,
)

train_experiment(
    training_steps=1000, 
    clean_start=True,
    incremental=True,
    run_config=run_config
)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11fb97150>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:48:20
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11e135110>

## Train: Checkpoints

By default, a checkpoint is saved every 600 secs (10mins). This behaviour is configured in the run_config passed to the estimator, using only one of the following parameters:

* **save_checkpoints_secs**: Save checkpoints every this many seconds. 
* **save_checkpoints_steps**: Save checkpoints every this many steps.

In addition, you can specify the number of the checkpoints to keep using **keep_checkpoint_max**  Defaults to 5 (that is, the 5 most recent checkpoint files are kept.) 


The following code trains the model for 1000 steps...

In [114]:
os.environ['MODEL_DIR'] = model_dir

In [112]:
train_input_fn = make_input_fn(
    TRAIN_DATA_FILE,
    batch_size=BATCH_SIZE,
    num_epochs=None,
    shuffle=True,
)

train_experiment(
    training_steps=1000, 
    clean_start=True,
    incremental=True,
    run_config=run_config # using the default checkpoints param values
)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11fb97150>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:49:03
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11ef36390>

In [115]:
%%bash

ls ${MODEL_DIR}

checkpoint
graph.pbtxt
model.ckpt-0.data-00000-of-00001
model.ckpt-0.index
model.ckpt-0.meta
model.ckpt-1000.data-00000-of-00001
model.ckpt-1000.index
model.ckpt-1000.meta


As shown, since the training (1000 iterasion) finished in less than 600 seconds (default value for **save_checkpoint_sec), only 2 checkpoints where saved: the initial one, and the final one.


Now let's set **save_checkpoints_steps** in the run_config to 200, so that in 1000 steps, you produce 5 checkpoints

In [116]:
run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    model_dir=model_dir,
    save_checkpoints_steps=200, ## so in 1000 steps, you produce 5 checkpoints
    save_checkpoints_secs=None
)


In [117]:
estimator=train_experiment(
    training_steps=1000, 
    clean_start=True,
    incremental=True,
    run_config=run_config 
)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 198301006, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11f8d4f90>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': 200, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 16:51:27
.......................................
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INF

In [118]:
%%bash

ls ${MODEL_DIR}

checkpoint
graph.pbtxt
model.ckpt-1000.data-00000-of-00001
model.ckpt-1000.index
model.ckpt-1000.meta
model.ckpt-200.data-00000-of-00001
model.ckpt-200.index
model.ckpt-200.meta
model.ckpt-400.data-00000-of-00001
model.ckpt-400.index
model.ckpt-400.meta
model.ckpt-600.data-00000-of-00001
model.ckpt-600.index
model.ckpt-600.meta
model.ckpt-800.data-00000-of-00001
model.ckpt-800.index
model.ckpt-800.meta


Each checkpoint is labelled by the step number it was saved in.

## Evaluate: Epochs vs Steps
* **batch_size** is set (which can be bigger than batch size of training, of the batch fits in memory)
* **num_epochs** is usually set to 1 (as you want to evaluate your model on the entire evaluation data once)

In [122]:
eval_input_fn = make_input_fn(
    EVAL_DATA_FILE,
    batch_size=BATCH_SIZE,
    num_epochs=1,
    shuffle=False,
)

For evaluation, if you set epochs to be 1, you can ignore the steps param (set it you None).


By default, the latest checkpoint is evaluated

In [123]:
estimator.evaluate(eval_input_fn, steps=None)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-15-16:52:48
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-1000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-15-16:52:50
INFO:tensorflow:Saving dict for global step 1000: accuracy = 0.7857231, accuracy_baseline = 0.7637916, auc = 0.8588228, auc_precision_recall = 0.64802057, average_loss = 0.47923627, global_step = 1000, label/mean = 0.23620838, loss = 95.13425, precision = 0.8635438, prediction/mean = 0.13552067, recall = 0.110273086
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 1000: models/census/dnn_classifier/model.ckpt-1000


{'accuracy': 0.7857231,
 'accuracy_baseline': 0.7637916,
 'auc': 0.8588228,
 'auc_precision_recall': 0.64802057,
 'average_loss': 0.47923627,
 'global_step': 1000,
 'label/mean': 0.23620838,
 'loss': 95.13425,
 'precision': 0.8635438,
 'prediction/mean': 0.13552067,
 'recall': 0.110273086}

This is equivalent to setting epochs to None and setting the **steps** to the number of batches in the dataset

## Export: Serving Input Receiver Function

In [125]:
def make_serving_input_receiver_fn():
    
    inputs = {}
    for feature_name in FEATURE_NAMES:
        dtype = tf.float32 if feature_name in NUMERIC_FEATURE_NAMES else tf.string
        inputs[feature_name] = tf.placeholder(shape=[None], dtype=dtype)
        
    return tf.estimator.export.build_raw_serving_input_receiver_fn(inputs)

export_dir = os.path.join(model_dir, 'export')

if tf.gfile.Exists(export_dir):
    tf.gfile.DeleteRecursively(export_dir)
        
estimator.export_savedmodel(
    export_dir_base=export_dir,
    serving_input_receiver_fn=make_serving_input_receiver_fn()
)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
INFO:tensorflow:'serving_default' : Classification input must be a single string Tensor; got {'hours_per_week': <tf.Tensor 'Placeholder_4:0' shape=(?,) dtype=float32>, 'workclass': <tf.Tensor 'Placeholder_10:0' shape=(?,) dtype=string>, 'relationship': <tf.Tensor 'Placeholder_9:0' shape=(?,) dtype=string>, 'gender': <tf.Tensor 'Placeholder_5:0' shape=(?,) dtype=string>, 'age': <tf.Tensor 'Placeholder:0' shape=(?,) dtype=float32>, 'marital_status': <tf.Tensor 'Placeholder_8:0' shape=(?,) dt

'models/census/dnn_classifier/export/1547571440'

In [126]:
%%bash

saved_models_base=${MODEL_DIR}/export/
saved_model_dir=${saved_models_base}$(ls ${saved_models_base} | tail -n 1)
echo ${saved_model_dir}
ls ${saved_model_dir}
saved_model_cli show --dir=${saved_model_dir} --all

models/census/dnn_classifier/export/1547571440
saved_model.pb
variables

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['age'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder:0
    inputs['capital_gain'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_2:0
    inputs['capital_loss'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_3:0
    inputs['education'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_7:0
    inputs['education_num'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_1:0
    inputs['gender'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_5:0
    inputs['hours_per_week'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
     

## Intertwining Training & Evalution
* Use TrainSpec & EvalSpec with tf.estimator.train_and_evaluate()
* In TrainSpec:
    * **num_epochs** in the **train_input_fn** is ignored. (Set it to None)
    * You need to set **max_steps** param, otherwise it will train forever
* In EvalSpec (to evaluate the model using the whole evaluation data once):
    * **num_epochs** is set to 1 in the **eval_input_fn**
    * **steps** param is set to None
* **Evaluation** occurs when a **new checkpoint** is saved
* Checkpoints saving frequency is configures in run_config (using save_checkpoints_steps or save_checkpoints_secs)
* You can set minimum amount of time between two evaluation, using **throttle_secs** in EvalSpec. For example, if **throttle_secs** is set to 60sec, this means that the following evaluation will only occure after 60sec from the previous evaluation, even if **save_checkpoins_sec** is set to 10.
* If **throttle_secs** is set to 0, then evaluation will occure each time a checkpoint is saved, regardless the time difference between two consequtive checkpoints

In [134]:
def train_and_evaluate_experiment(params, run_config):
    
    # TrainSpec ####################################
    train_input_fn = make_input_fn(
        TRAIN_DATA_FILE,
        batch_size=BATCH_SIZE,
        num_epochs=None,
        shuffle=True
    )
    
    train_spec = tf.estimator.TrainSpec(
        input_fn = train_input_fn,
        max_steps=params.traning_steps
    )
    ###############################################
    
    
    # EvalSpec ####################################
    eval_input_fn = make_input_fn(
        EVAL_DATA_FILE,
        batch_size=BATCH_SIZE,
        num_epochs=1,
        shuffle=False
    )

    eval_spec = tf.estimator.EvalSpec(
        name=datetime.utcnow().strftime("%H%M%S"),
        input_fn = eval_input_fn,
        steps=None,
        start_delay_secs=0,
        throttle_secs=params.eval_throttle_secs
    )
    
    ###############################################

    tf.logging.set_verbosity(tf.logging.INFO)
    
    if params.clean_start:
        if tf.gfile.Exists(run_config.model_dir):
            print("Removing previous artefacts...")
            tf.gfile.DeleteRecursively(run_config.model_dir)
            

    print("")
    estimator = create_estimator(run_config)
    print("")
    
    time_start = datetime.utcnow() 
    print("Experiment started at {}".format(time_start.strftime("%H:%M:%S")))
    print(".......................................") 

    tf.estimator.train_and_evaluate(
        estimator=estimator,
        train_spec=train_spec, 
        eval_spec=eval_spec
    )

    time_end = datetime.utcnow() 
    print(".......................................")
    print("Experiment finished at {}".format(time_end.strftime("%H:%M:%S")))
    print("")
    time_elapsed = time_end - time_start
    print("Experiment elapsed time: {} seconds".format(time_elapsed.total_seconds()))
    
    return estimator


Now let's try the following:
* Training for 1000 steps (set **num_epochs** to None and **max_steps** to 1000).
* Save a checkpoint after each 200 steps (set **save_checkpoints_steps** to 200).
* Evaluate when each checkpoint is produced (set **eval_throttle_secs** to 0). That is, 5 evaluations in total
* Keep only the latest 3 checkpoints out of the 5 checkpoints to be saved (set **keep_checkpoint_max** to 3)
* When evaluating, use the whole eval_data once (set **num_epochs** to 1 and **steps** to None).


In [135]:
params  = tf.contrib.training.HParams(
    batch_size=BATCH_SIZE,
    traning_steps=1000,
    eval_throttle_secs=0,
    clean_start=True,
)

run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    save_checkpoints_steps=200,
    keep_checkpoint_max=3,
    model_dir=model_dir
)

In [136]:
train_and_evaluate_experiment(params, run_config)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 3, '_tf_random_seed': 19830610, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11e3c9b90>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': 200, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 17:37:36
.......................................
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x12180ff90>

In [137]:
%%bash

ls ${MODEL_DIR}

checkpoint
eval_173736
graph.pbtxt
model.ckpt-1000.data-00000-of-00001
model.ckpt-1000.index
model.ckpt-1000.meta
model.ckpt-600.data-00000-of-00001
model.ckpt-600.index
model.ckpt-600.meta
model.ckpt-800.data-00000-of-00001
model.ckpt-800.index
model.ckpt-800.meta


In order to train the model for **num_epochs**, you need to do the following:
* the training data size needs to be known before training
* compute the training steps as: **(TRAIN_DATA_SIZE / BATCH_SIZE) * num_epochs**
* In TrainSpec, **max_step** to the computed value
* set **num_epochs** in the train_input_fn to None

In [142]:
num_epochs = 5
computed_traning_steps = int(math.ceil((TRAIN_DATA_SIZE/float(BATCH_SIZE)))*num_epochs)

print('Training data size: {}'.format(TRAIN_DATA_SIZE))
print('Batch size: {}'.format(BATCH_SIZE))
print('Number of epochs (supplied): {}'.format(BATCH_SIZE)) 
print('Number of training steps (computed): {}'.format(computed_traning_steps))
print('')


params = tf.contrib.training.HParams(
    batch_size=BATCH_SIZE,
    traning_steps=computed_traning_steps,
    eval_throttle_secs=0,
    clean_start=True,
)

run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    save_checkpoints_steps=200,
    model_dir=model_dir
)

train_and_evaluate_experiment(params, run_config)

Training data size: 32561
Batch size: 200
Number of epochs (supplied): 200
Number of training steps (computed): 815

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19830610, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11a8771d0>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': 200, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 17:42:28
.......................................
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. Th

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11a3b6b90>

Note that, since we set **save_checkpoint_steps** to 200, and we have 815 steps, we get 5 checkpoints (and corresponding evaluations):

* at step 200
* at step 400
* at step 600
* at step 800
* in the end (at step 815)

In [143]:
%%bash

ls ${MODEL_DIR}

checkpoint
eval_174228
graph.pbtxt
model.ckpt-200.data-00000-of-00001
model.ckpt-200.index
model.ckpt-200.meta
model.ckpt-400.data-00000-of-00001
model.ckpt-400.index
model.ckpt-400.meta
model.ckpt-600.data-00000-of-00001
model.ckpt-600.index
model.ckpt-600.meta
model.ckpt-800.data-00000-of-00001
model.ckpt-800.index
model.ckpt-800.meta
model.ckpt-815.data-00000-of-00001
model.ckpt-815.index
model.ckpt-815.meta


Note that the checkpoint at the beginning of the training is removed, since **keep_checkpoint_max** is set to 5 (default value).

## Train, Evaluate, and Export

In [147]:
def train_evaluate_export_experiment(params, run_config, exporter):
    
    # TrainSpec ####################################
    train_input_fn = make_input_fn(
        TRAIN_DATA_FILE,
        batch_size=BATCH_SIZE,
        num_epochs=None,
        shuffle=True
    )
    
    train_spec = tf.estimator.TrainSpec(
        input_fn = train_input_fn,
        max_steps=params.traning_steps
    )
    ###############################################
    
    
    # EvalSpec ####################################
    eval_input_fn = make_input_fn(
        EVAL_DATA_FILE,
        batch_size=BATCH_SIZE,
        num_epochs=1,
        shuffle=False
    )
    
    eval_spec = tf.estimator.EvalSpec(
        name=params.eval_name,
        input_fn=eval_input_fn,
        exporters=[exporter],
        steps=None,
        start_delay_secs=0,
        throttle_secs=params.eval_throttle_secs
    )
    ###############################################

    tf.logging.set_verbosity(tf.logging.INFO)
    
    if params.clean_start:
        if tf.gfile.Exists(run_config.model_dir):
            print("Removing previous artefacts...")
            tf.gfile.DeleteRecursively(run_config.model_dir)
            

    print("")
    estimator = create_estimator(run_config)
    print("")
    
    time_start = datetime.utcnow() 
    print("Experiment started at {}".format(time_start.strftime("%H:%M:%S")))
    print(".......................................") 

    tf.estimator.train_and_evaluate(
        estimator=estimator,
        train_spec=train_spec, 
        eval_spec=eval_spec
    )

    time_end = datetime.utcnow() 
    print(".......................................")
    print("Experiment finished at {}".format(time_end.strftime("%H:%M:%S")))
    print("")
    time_elapsed = time_end - time_start
    print("Experiment elapsed time: {} seconds".format(time_elapsed.total_seconds()))
    
    return estimator


### **Latest exporter** 
This exports a model after each evaluation. 

You can specify the maximum number of exported models to keep using **exports_to_keep** param

In [148]:
exporter = tf.estimator.LatestExporter(
            name="estimate", 
            serving_input_receiver_fn=make_serving_input_receiver_fn(),
            exports_to_keep=3,
)

params = tf.contrib.training.HParams(
    batch_size=BATCH_SIZE,
    traning_steps=computed_traning_steps,
    eval_throttle_secs=0,
    clean_start=True,
    eval_name=datetime.utcnow().strftime("%H%M%S")
)

run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    save_checkpoints_steps=200,
    model_dir=model_dir
)

train_evaluate_export_experiment(params, run_config, exporter)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19830610, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11e2a05d0>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': 250, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 17:48:58
.......................................
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_

INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
INFO:tensorflow:'serving_default' : Classification input must be a single string Tensor; got {'hours_per_week': <tf.Tensor 'Placeholder_56:0' shape=(?,) dtype=float32>, 'workclass': <tf.Tensor 'Placeholder_62:0' shape=(?,) dtype=string>, 'relationship': <tf.Tensor 'Placeholder_61:0' shape=(?,) dtype=string>, 'gender': <tf.Tensor 'Placeholder_57:0' shape=(?,) dtype=string>, 'age': <tf.Tensor 'Placeholder_52:0' shape=(?,) dtype=float32>, 'marital_status': <tf.Tensor 'Placeholder_60:0' shape=(?,) dtype=string>, 'race': <tf.Tensor 'Placeholder_58:0' shape=(?,) dtype=string>, 'capital_gain': <tf.Tensor 'Placeholder_54:0' shape=(?,) dtype=float32>, 'native_country': <tf.Tensor 'Placeholder_

INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-750
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: models/census/dnn_classifier/export/estimate/temp-1547574570/saved_model.pb
INFO:tensorflow:global_step/sec: 12.9171
INFO:tensorflow:loss = 73.53612, step = 801 (7.743 sec)
INFO:tensorflow:Saving checkpoints for 815 into models/census/dnn_classifier/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-15-17:49:35
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-815
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-15-17:49:37
INFO:tensorflow:Saving dict for global step 815: accuracy = 0.79911536, accuracy_baseline = 0.7637916, auc = 0.88614124, auc_precision_recall

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x121920650>

In [149]:
%%bash

saved_models_base=${MODEL_DIR}/export/estimate/
echo 'exported model folders:'
ls ${saved_models_base}
echo ''

saved_model_dir=${saved_models_base}$(ls ${saved_models_base} | tail -n 1)
echo 'last exported model: '${saved_model_dir}
ls ${saved_model_dir}
saved_model_cli show --dir=${saved_model_dir} --all

exported model folders:
1547574561
1547574570
1547574577

last exported model: models/census/dnn_classifier/export/estimate/1547574577
saved_model.pb
variables

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['age'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_52:0
    inputs['capital_gain'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_54:0
    inputs['capital_loss'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_55:0
    inputs['education'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_59:0
    inputs['education_num'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_53:0
    inputs['gender'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_

### **Final exporter** 
This exports only the very last evaluated checkpoint of the model

In [150]:
exporter = tf.estimator.FinalExporter(
            name="estimate",
            serving_input_receiver_fn=make_serving_input_receiver_fn()
)

params = tf.contrib.training.HParams(
    batch_size=BATCH_SIZE,
    traning_steps=computed_traning_steps,
    eval_throttle_secs=0,
    clean_start=True,
    eval_name=datetime.utcnow().strftime("%H%M%S")
)

run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    save_checkpoints_steps=200,
    model_dir=model_dir
)

train_evaluate_export_experiment(params, run_config, exporter)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19830610, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x11a266d90>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': 250, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 17:50:16
.......................................
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_

INFO:tensorflow:'regression' : Regression input must be a single string Tensor; got {'hours_per_week': <tf.Tensor 'Placeholder_69:0' shape=(?,) dtype=float32>, 'workclass': <tf.Tensor 'Placeholder_75:0' shape=(?,) dtype=string>, 'relationship': <tf.Tensor 'Placeholder_74:0' shape=(?,) dtype=string>, 'gender': <tf.Tensor 'Placeholder_70:0' shape=(?,) dtype=string>, 'age': <tf.Tensor 'Placeholder_65:0' shape=(?,) dtype=float32>, 'marital_status': <tf.Tensor 'Placeholder_73:0' shape=(?,) dtype=string>, 'race': <tf.Tensor 'Placeholder_71:0' shape=(?,) dtype=string>, 'capital_gain': <tf.Tensor 'Placeholder_67:0' shape=(?,) dtype=float32>, 'native_country': <tf.Tensor 'Placeholder_77:0' shape=(?,) dtype=string>, 'capital_loss': <tf.Tensor 'Placeholder_68:0' shape=(?,) dtype=float32>, 'education': <tf.Tensor 'Placeholder_72:0' shape=(?,) dtype=string>, 'education_num': <tf.Tensor 'Placeholder_66:0' shape=(?,) dtype=float32>, 'occupation': <tf.Tensor 'Placeholder_76:0' shape=(?,) dtype=string>

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x11a266390>

In [151]:
%%bash

saved_models_base=${MODEL_DIR}/export/estimate/
echo 'exported model folders:'
ls ${saved_models_base}
echo ''

saved_model_dir=${saved_models_base}$(ls ${saved_models_base} | tail -n 1)
echo 'last exported model: '${saved_model_dir}
ls ${saved_model_dir}
saved_model_cli show --dir=${saved_model_dir} --all

exported model folders:
1547574655

last exported model: models/census/dnn_classifier/export/estimate/1547574655
saved_model.pb
variables

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['age'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_65:0
    inputs['capital_gain'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_67:0
    inputs['capital_loss'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_68:0
    inputs['education'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_72:0
    inputs['education_num'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_66:0
    inputs['gender'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_70:0
    inputs['hours

### **Best exporter** 
This runs everytime when the new model is better than any exsiting model. 

It uses the evaluation events stored under the **eval** folder. 

You need to set the **name** of the subfolder in the EvalSpec, and set the **event_file_pattern** in the BestExporter to point to this folder and perform the evalution comparesions.

In [157]:
eval_name=datetime.utcnow().strftime("%H%M%S")

exporter = tf.estimator.BestExporter(
    event_file_pattern='eval_{}/*.tfevents.*'.format(eval_name),
    name="estimate", 
    serving_input_receiver_fn=make_serving_input_receiver_fn(),
    exports_to_keep=1
)

params = tf.contrib.training.HParams(
    batch_size=BATCH_SIZE,
    traning_steps=10000,
    eval_throttle_secs=0,
    exporter_type='best',
    clean_start=True,
    eval_name=eval_name
)

run_config = tf.estimator.RunConfig(
    tf_random_seed=SEED,
    save_checkpoints_steps=500,
    model_dir=model_dir
)

train_evaluate_export_experiment(params, run_config, exporter)

Removing previous artefacts...

INFO:tensorflow:Using config: {'_save_checkpoints_secs': None, '_global_id_in_cluster': 0, '_session_config': None, '_keep_checkpoint_max': 5, '_tf_random_seed': 19831060, '_task_type': 'worker', '_train_distribute': None, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x124afb850>, '_model_dir': 'models/census/dnn_classifier', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_master': '', '_save_checkpoints_steps': 500, '_keep_checkpoint_every_n_hours': 10000, '_evaluation_master': '', '_service': None, '_device_fn': None, '_save_summary_steps': 100, '_num_ps_replicas': 0}

Experiment started at 17:56:27
.......................................
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_

INFO:tensorflow:global_step/sec: 111.021
INFO:tensorflow:loss = 66.10107, step = 2201 (0.901 sec)
INFO:tensorflow:global_step/sec: 114.271
INFO:tensorflow:loss = 77.2023, step = 2301 (0.876 sec)
INFO:tensorflow:global_step/sec: 111.396
INFO:tensorflow:loss = 65.51532, step = 2401 (0.897 sec)
INFO:tensorflow:Saving checkpoints for 2500 into models/census/dnn_classifier/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-15-17:57:19
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-2500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-15-17:57:22
INFO:tensorflow:Saving dict for global step 2500: accuracy = 0.7984396, accuracy_baseline = 0.7637916, auc = 0.8953998, auc_precision_recall = 0.7283441, average_loss = 0.37381795, global_step = 2500, label/mean = 0.236

INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-15-17:57:45
INFO:tensorflow:Saving dict for global step 3500: accuracy = 0.80267847, accuracy_baseline = 0.7637916, auc = 0.8970522, auc_precision_recall = 0.7322527, average_loss = 0.37224555, global_step = 3500, label/mean = 0.23620838, loss = 73.89528, precision = 0.9031847, prediction/mean = 0.16327314, recall = 0.18439531
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 3500: models/census/dnn_classifier/model.ckpt-3500
INFO:tensorflow:global_step/sec: 16.7232
INFO:tensorflow:loss = 61.537792, step = 3501 (5.980 sec)
INFO:tensorflow:global_step/sec: 90.3517
INFO:tensorflow:loss = 64.823944, step = 3601 (1.106 sec)
INFO:tensorflow:global_step/sec: 79.041
INFO:tensorflow:loss = 53.05521, step = 3701 (1.266 sec)
INFO:tensorflow:global_step/sec: 100.916
INFO:tensorflow:loss = 63.682304, step = 3801 (0.991 sec)
INFO:tensorflow:global_step/sec: 78.4977
INFO:tensorflow:loss = 62.415

INFO:tensorflow:global_step/sec: 103.779
INFO:tensorflow:loss = 54.095524, step = 4901 (0.963 sec)
INFO:tensorflow:Saving checkpoints for 5000 into models/census/dnn_classifier/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-15-17:58:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-5000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-01-15-17:58:18
INFO:tensorflow:Saving dict for global step 5000: accuracy = 0.81963384, accuracy_baseline = 0.7637916, auc = 0.8945979, auc_precision_recall = 0.7114821, average_loss = 0.35755154, global_step = 5000, label/mean = 0.23620838, loss = 70.97834, precision = 0.7554806, prediction/mean = 0.2260002, recall = 0.34954485
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 5000: models/census/dnn_classifi

INFO:tensorflow:'regression' : Regression input must be a single string Tensor; got {'hours_per_week': <tf.Tensor 'Placeholder_121:0' shape=(?,) dtype=float32>, 'workclass': <tf.Tensor 'Placeholder_127:0' shape=(?,) dtype=string>, 'relationship': <tf.Tensor 'Placeholder_126:0' shape=(?,) dtype=string>, 'gender': <tf.Tensor 'Placeholder_122:0' shape=(?,) dtype=string>, 'age': <tf.Tensor 'Placeholder_117:0' shape=(?,) dtype=float32>, 'marital_status': <tf.Tensor 'Placeholder_125:0' shape=(?,) dtype=string>, 'race': <tf.Tensor 'Placeholder_123:0' shape=(?,) dtype=string>, 'capital_gain': <tf.Tensor 'Placeholder_119:0' shape=(?,) dtype=float32>, 'native_country': <tf.Tensor 'Placeholder_129:0' shape=(?,) dtype=string>, 'capital_loss': <tf.Tensor 'Placeholder_120:0' shape=(?,) dtype=float32>, 'education': <tf.Tensor 'Placeholder_124:0' shape=(?,) dtype=string>, 'education_num': <tf.Tensor 'Placeholder_118:0' shape=(?,) dtype=float32>, 'occupation': <tf.Tensor 'Placeholder_128:0' shape=(?,) 

INFO:tensorflow:'classification' : Classification input must be a single string Tensor; got {'hours_per_week': <tf.Tensor 'Placeholder_121:0' shape=(?,) dtype=float32>, 'workclass': <tf.Tensor 'Placeholder_127:0' shape=(?,) dtype=string>, 'relationship': <tf.Tensor 'Placeholder_126:0' shape=(?,) dtype=string>, 'gender': <tf.Tensor 'Placeholder_122:0' shape=(?,) dtype=string>, 'age': <tf.Tensor 'Placeholder_117:0' shape=(?,) dtype=float32>, 'marital_status': <tf.Tensor 'Placeholder_125:0' shape=(?,) dtype=string>, 'race': <tf.Tensor 'Placeholder_123:0' shape=(?,) dtype=string>, 'capital_gain': <tf.Tensor 'Placeholder_119:0' shape=(?,) dtype=float32>, 'native_country': <tf.Tensor 'Placeholder_129:0' shape=(?,) dtype=string>, 'capital_loss': <tf.Tensor 'Placeholder_120:0' shape=(?,) dtype=float32>, 'education': <tf.Tensor 'Placeholder_124:0' shape=(?,) dtype=string>, 'education_num': <tf.Tensor 'Placeholder_118:0' shape=(?,) dtype=float32>, 'occupation': <tf.Tensor 'Placeholder_128:0' sha

INFO:tensorflow:Saving dict for global step 7500: accuracy = 0.8493058, accuracy_baseline = 0.7637916, auc = 0.8981279, auc_precision_recall = 0.7397051, average_loss = 0.3412308, global_step = 7500, label/mean = 0.23620838, loss = 67.73848, precision = 0.7457627, prediction/mean = 0.22837831, recall = 0.54928476
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 7500: models/census/dnn_classifier/model.ckpt-7500
INFO:tensorflow:Performing best model export.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Signatures INCLUDED in export for Eval: None
INFO:tensorflow:Signatures INCLUDED in export for Classify: None
INFO:tensorflow:Signatures INCLUDED in export for Regress: None
INFO:tensorflow:Signatures INCLUDED in export for Predict: ['predict']
INFO:tensorflow:Signatures INCLUDED in export for Train: None
INFO:tensorflow:Signatures EXCLUDED from export because they cannot be be served via TensorFlow Serving APIs:
INFO:tensorflow:'

INFO:tensorflow:'serving_default' : Classification input must be a single string Tensor; got {'hours_per_week': <tf.Tensor 'Placeholder_121:0' shape=(?,) dtype=float32>, 'workclass': <tf.Tensor 'Placeholder_127:0' shape=(?,) dtype=string>, 'relationship': <tf.Tensor 'Placeholder_126:0' shape=(?,) dtype=string>, 'gender': <tf.Tensor 'Placeholder_122:0' shape=(?,) dtype=string>, 'age': <tf.Tensor 'Placeholder_117:0' shape=(?,) dtype=float32>, 'marital_status': <tf.Tensor 'Placeholder_125:0' shape=(?,) dtype=string>, 'race': <tf.Tensor 'Placeholder_123:0' shape=(?,) dtype=string>, 'capital_gain': <tf.Tensor 'Placeholder_119:0' shape=(?,) dtype=float32>, 'native_country': <tf.Tensor 'Placeholder_129:0' shape=(?,) dtype=string>, 'capital_loss': <tf.Tensor 'Placeholder_120:0' shape=(?,) dtype=float32>, 'education': <tf.Tensor 'Placeholder_124:0' shape=(?,) dtype=string>, 'education_num': <tf.Tensor 'Placeholder_118:0' shape=(?,) dtype=float32>, 'occupation': <tf.Tensor 'Placeholder_128:0' sh

INFO:tensorflow:Restoring parameters from models/census/dnn_classifier/model.ckpt-9000
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
INFO:tensorflow:SavedModel written to: models/census/dnn_classifier/export/estimate/temp-1547575184/saved_model.pb
INFO:tensorflow:global_step/sec: 14.7885
INFO:tensorflow:loss = 60.622482, step = 9001 (6.762 sec)
INFO:tensorflow:global_step/sec: 127.27
INFO:tensorflow:loss = 59.719437, step = 9101 (0.787 sec)
INFO:tensorflow:global_step/sec: 121.027
INFO:tensorflow:loss = 65.30684, step = 9201 (0.826 sec)
INFO:tensorflow:global_step/sec: 111.639
INFO:tensorflow:loss = 77.9431, step = 9301 (0.897 sec)
INFO:tensorflow:global_step/sec: 109.415
INFO:tensorflow:loss = 67.03885, step = 9401 (0.913 sec)
INFO:tensorflow:Saving checkpoints for 9500 into models/census/dnn_classifier/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-01-15-17:59:52
INFO:tensor

<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x124afb890>

In [158]:
%%bash

saved_models_base=${MODEL_DIR}/export/estimate/
echo 'exported model folders:'
ls ${saved_models_base}
echo ''

saved_model_dir=${saved_models_base}$(ls ${saved_models_base} | tail -n 1)
echo 'last exported model: '${saved_model_dir}
ls ${saved_model_dir}
saved_model_cli show --dir=${saved_model_dir} --all

exported model folders:
1547575194

last exported model: models/census/dnn_classifier/export/estimate/1547575194
saved_model.pb
variables

MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['age'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_117:0
    inputs['capital_gain'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_119:0
    inputs['capital_loss'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_120:0
    inputs['education'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_124:0
    inputs['education_num'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1)
        name: Placeholder_118:0
    inputs['gender'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: Placeholder_122:0
    inputs[