GNMT v2 Tensorflow: How to enable automatic mixed precision for evaluation run #282

mankeyboy · 2019-11-07T15:33:13Z

I'm trying to run the GNMT TF code on a baremetal system and I've setup the CUDA stack and tensorflow-gpu v1.15. There were a few API changes for Tensorflow from 1.14 to 1.15 but after solving that, I was able to run the code for training as well as evaluation.

However, looking at the logs and comparing from the NGC container, I see that this baremetal run isn't making use of AMP. I went into Nvidia's docs and found the way to enable it for training here.
I added the following line before here:

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

However, I can't see automatic mixed precision being used for evaluation since the optimizer is only called during Backprop. So, I tried modifying the eval function by adding the mixed_precision_rewrite to the eval_fn() by modifying the graph config in estimator.py:

def eval_fn(hparams, ckpt=None, only_translate=False):
  model_fn = make_model_fn(hparams)
  sess_config = tf.ConfigProto(allow_soft_placement=True)
  sess_config.graph_options.rewrite_options.auto_mixed_precision=1
  config = tf.estimator.RunConfig(
        log_step_count_steps=hparams.log_step_count_steps,
        session_config=sess_config)
  pred_estimator = tf.estimator.Estimator(
      model_fn=model_fn, model_dir=hparams.output_dir, config=config)
  return get_metrics(hparams, model_fn, pred_estimator, ckpt, only_translate=only_translate)

and commenting out this call.

However, this gives an error on running:

Colocation members, user-requested devices, and framework assigned devices, if any:
  tower_0/v0/index_to_string/hash_table (HashTableV2) /device:GPU:0
  tower_0/v0/index_to_string/table_init/InitializeTableFromTextFileV2 (InitializeTableFromTextFileV2) /device:GPU:0
  tower_0/v0/hash_table_Lookup/LookupTableFindV2 (LookupTableFindV2) /device:GPU:0

2019-11-07 07:51:24.124179: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.124776: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.803817: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.804442: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
I1107 07:51:24.825255 140735364352992 session_manager.py:500] Running local_init_op.
2019-11-07 07:51:24.846707: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.846978: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.870466: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file results/vocab.bpe.32000.en is already initialized.
I1107 07:51:24.872127 140735364352992 session_manager.py:502] Done running local_init_op.
2019-11-07 07:51:24.902816: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.903393: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.950724: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.951080: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.958353: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.960220: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.961727: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.963636: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.965878: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.967928: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:25.309130: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:25.319260: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1775] auto_mixed_precision graph optimizer FAILED: Failed precondition: Expected exactly 1 output from port tower_0/v0/dynamic_seq2seq/decoder/decoder/while/NextIteration_22:0, got 2
2019-11-07 07:51:25.319653: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] auto_mixed_precision failed: Failed precondition: Expected exactly 1 output from port tower_0/v0/dynamic_seq2seq/decoder/decoder/while/NextIteration_22:0, got 2
2019-11-07 07:51:25.497377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
I1107 07:53:57.598690 140735364352992 estimator.py:748] Writing to file results/newstest2014_out_4000.tok.de
W1107 07:53:57.614538 140735364352992 deprecation_wrapper.py:119] From /home/mayroy13/Mayank/Mayank/test/nvidia_tf_examples/gnmt_v2/estimator.py:758: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

W1107 07:53:57.615267 140735364352992 deprecation_wrapper.py:119] From /home/mayroy13/Mayank/Mayank/test/nvidia_tf_examples/gnmt_v2/estimator.py:685: The name tf.gfile.Remove is deprecated. Please use tf.io.gfile.remove instead.

W1107 07:53:57.615499 140735364352992 deprecation_wrapper.py:119] From /home/mayroy13/Mayank/Mayank/test/nvidia_tf_examples/gnmt_v2/estimator.py:686: The name tf.gfile.Copy is deprecated. Please use tf.io.gfile.copy instead.

Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de

Any leads would be helpful to enable automatic mixed precision for evaluation. Thanks :)

The text was updated successfully, but these errors were encountered:

maciej-sypetkowski · 2019-11-13T12:56:35Z

Using AMP with official Tensorflow is a little bit different than with NGC containers, and changes you've done, should be enough to make AMP working with official Tensorflow.

I've tried to reproduce your problem. I took tensorflow/tensorflow:1.15.0-gpu-py3 container, and made changes you've described (see patch below). It works without any problems in training and evaluation and uses AMP.

If this patch doesn't work for you, then the problem is probably with your setup.

diff --git a/TensorFlow/Translation/GNMT/block_lstm.py b/TensorFlow/Translation/GNMT/block_lstm.py
index 3b0c784..559d620 100644
--- a/TensorFlow/Translation/GNMT/block_lstm.py
+++ b/TensorFlow/Translation/GNMT/block_lstm.py
@@ -20,7 +20,7 @@ from __future__ import print_function
 import abc
 import tensorflow as tf
 
-from tensorflow.contrib.rnn.ops import gen_lstm_ops
+from tensorflow.python.ops import gen_rnn_ops as gen_lstm_ops
 from tensorflow.python.framework import function
 from tensorflow.python.layers import base as base_layer
 
diff --git a/TensorFlow/Translation/GNMT/estimator.py b/TensorFlow/Translation/GNMT/estimator.py
index a0e7fc5..72e725a 100644
--- a/TensorFlow/Translation/GNMT/estimator.py
+++ b/TensorFlow/Translation/GNMT/estimator.py
@@ -214,6 +214,9 @@ class ModelFnFactory(object):
       opt = tf.train.AdamOptimizer(learning_rate)
     else:
       raise ValueError("Unknown optimizer type %s" % hparams.optimizer)
+
+    if hparams.use_amp:
+      opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
     return opt
 
   def _compute_tower_grads(self, tower_loss, tower_params, learning_rate, use_fp16=False,
@@ -712,10 +715,11 @@ def get_sacrebleu(trans_file, detokenizer_file):
   return float(score)
 
 
-def get_metrics(hparams, model_fn, ckpt=None, only_translate=False):
+def get_metrics(hparams, model_fn, pred_estimator=None, ckpt=None, only_translate=False):
   """Run inference and compute metrics."""
-  pred_estimator = tf.estimator.Estimator(
-      model_fn=model_fn, model_dir=hparams.output_dir)
+  if pred_estimator is None:
+    pred_estimator = tf.estimator.Estimator(
+        model_fn=model_fn, model_dir=hparams.output_dir)
 
   benchmark_hook = BenchmarkHook(hparams.infer_batch_size)
 
@@ -836,4 +840,12 @@ def train_fn(hparams):
 
 def eval_fn(hparams, ckpt=None, only_translate=False):
   model_fn = make_model_fn(hparams)
-  return get_metrics(hparams, model_fn, ckpt, only_translate=only_translate)
+  sess_config = tf.ConfigProto(allow_soft_placement=True)
+  if hparams.use_amp:
+    sess_config.graph_options.rewrite_options.auto_mixed_precision = 1
+  config = tf.estimator.RunConfig(
+        log_step_count_steps=hparams.log_step_count_steps,
+        session_config=sess_config)
+  pred_estimator = tf.estimator.Estimator(
+      model_fn=model_fn, model_dir=hparams.output_dir, config=config)
+  return get_metrics(hparams, model_fn, pred_estimator, ckpt, only_translate=only_translate)

mankeyboy · 2019-11-14T08:33:04Z

Thank you for the patch. Your patch is doing exactly what I was intending to and I retested my code with TF-1.15 and it seems to be working while TF1.14 was throwing the error I'd posted in my original issue. Is the error showing for TF1.14 a design change or a bug in 1.14?

maciej-sypetkowski · 2019-11-14T13:03:59Z

It was a bug in 1.14 and it has been fixed in 1.15

mankeyboy · 2019-11-14T13:15:02Z

Thanks, I'll close this issue then.

mankeyboy closed this as completed Nov 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GNMT v2 Tensorflow: How to enable automatic mixed precision for evaluation run #282

GNMT v2 Tensorflow: How to enable automatic mixed precision for evaluation run #282

mankeyboy commented Nov 7, 2019

maciej-sypetkowski commented Nov 13, 2019

mankeyboy commented Nov 14, 2019

maciej-sypetkowski commented Nov 14, 2019 •

edited

mankeyboy commented Nov 14, 2019

GNMT v2 Tensorflow: How to enable automatic mixed precision for evaluation run #282

GNMT v2 Tensorflow: How to enable automatic mixed precision for evaluation run #282

Comments

mankeyboy commented Nov 7, 2019

maciej-sypetkowski commented Nov 13, 2019

mankeyboy commented Nov 14, 2019

maciej-sypetkowski commented Nov 14, 2019 • edited

mankeyboy commented Nov 14, 2019

maciej-sypetkowski commented Nov 14, 2019 •

edited