Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNMT v2 Tensorflow: How to enable automatic mixed precision for evaluation run #282

Closed
mankeyboy opened this issue Nov 7, 2019 · 4 comments

Comments

@mankeyboy
Copy link

I'm trying to run the GNMT TF code on a baremetal system and I've setup the CUDA stack and tensorflow-gpu v1.15. There were a few API changes for Tensorflow from 1.14 to 1.15 but after solving that, I was able to run the code for training as well as evaluation.

However, looking at the logs and comparing from the NGC container, I see that this baremetal run isn't making use of AMP. I went into Nvidia's docs and found the way to enable it for training here.
I added the following line before here:

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

However, I can't see automatic mixed precision being used for evaluation since the optimizer is only called during Backprop. So, I tried modifying the eval function by adding the mixed_precision_rewrite to the eval_fn() by modifying the graph config in estimator.py:

def eval_fn(hparams, ckpt=None, only_translate=False):
  model_fn = make_model_fn(hparams)
  sess_config = tf.ConfigProto(allow_soft_placement=True)
  sess_config.graph_options.rewrite_options.auto_mixed_precision=1
  config = tf.estimator.RunConfig(
        log_step_count_steps=hparams.log_step_count_steps,
        session_config=sess_config)
  pred_estimator = tf.estimator.Estimator(
      model_fn=model_fn, model_dir=hparams.output_dir, config=config)
  return get_metrics(hparams, model_fn, pred_estimator, ckpt, only_translate=only_translate)

and commenting out this call.

However, this gives an error on running:

Colocation members, user-requested devices, and framework assigned devices, if any:
  tower_0/v0/index_to_string/hash_table (HashTableV2) /device:GPU:0
  tower_0/v0/index_to_string/table_init/InitializeTableFromTextFileV2 (InitializeTableFromTextFileV2) /device:GPU:0
  tower_0/v0/hash_table_Lookup/LookupTableFindV2 (LookupTableFindV2) /device:GPU:0

2019-11-07 07:51:24.124179: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.124776: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.803817: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.804442: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
I1107 07:51:24.825255 140735364352992 session_manager.py:500] Running local_init_op.
2019-11-07 07:51:24.846707: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.846978: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.870466: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file results/vocab.bpe.32000.en is already initialized.
I1107 07:51:24.872127 140735364352992 session_manager.py:502] Done running local_init_op.
2019-11-07 07:51:24.902816: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.903393: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.950724: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:24.951080: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.958353: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.960220: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.961727: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.963636: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.965878: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:24.967928: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1241] No whitelist ops found, nothing to do
2019-11-07 07:51:25.309130: I tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1767] Running auto_mixed_precision graph optimizer
2019-11-07 07:51:25.319260: W tensorflow/core/grappler/optimizers/auto_mixed_precision.cc:1775] auto_mixed_precision graph optimizer FAILED: Failed precondition: Expected exactly 1 output from port tower_0/v0/dynamic_seq2seq/decoder/decoder/while/NextIteration_22:0, got 2
2019-11-07 07:51:25.319653: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:502] auto_mixed_precision failed: Failed precondition: Expected exactly 1 output from port tower_0/v0/dynamic_seq2seq/decoder/decoder/while/NextIteration_22:0, got 2
2019-11-07 07:51:25.497377: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
I1107 07:53:57.598690 140735364352992 estimator.py:748] Writing to file results/newstest2014_out_4000.tok.de
W1107 07:53:57.614538 140735364352992 deprecation_wrapper.py:119] From /home/mayroy13/Mayank/Mayank/test/nvidia_tf_examples/gnmt_v2/estimator.py:758: The name tf.summary.FileWriter is deprecated. Please use tf.compat.v1.summary.FileWriter instead.

W1107 07:53:57.615267 140735364352992 deprecation_wrapper.py:119] From /home/mayroy13/Mayank/Mayank/test/nvidia_tf_examples/gnmt_v2/estimator.py:685: The name tf.gfile.Remove is deprecated. Please use tf.io.gfile.remove instead.

W1107 07:53:57.615499 140735364352992 deprecation_wrapper.py:119] From /home/mayroy13/Mayank/Mayank/test/nvidia_tf_examples/gnmt_v2/estimator.py:686: The name tf.gfile.Copy is deprecated. Please use tf.io.gfile.copy instead.

Warning: No built-in rules for language de.
Detokenizer Version $Revision: 4134 $
Language: de

Any leads would be helpful to enable automatic mixed precision for evaluation. Thanks :)

@maciej-sypetkowski
Copy link

Using AMP with official Tensorflow is a little bit different than with NGC containers, and changes you've done, should be enough to make AMP working with official Tensorflow.

I've tried to reproduce your problem. I took tensorflow/tensorflow:1.15.0-gpu-py3 container, and made changes you've described (see patch below). It works without any problems in training and evaluation and uses AMP.

If this patch doesn't work for you, then the problem is probably with your setup.

diff --git a/TensorFlow/Translation/GNMT/block_lstm.py b/TensorFlow/Translation/GNMT/block_lstm.py
index 3b0c784..559d620 100644
--- a/TensorFlow/Translation/GNMT/block_lstm.py
+++ b/TensorFlow/Translation/GNMT/block_lstm.py
@@ -20,7 +20,7 @@ from __future__ import print_function
 import abc
 import tensorflow as tf
 
-from tensorflow.contrib.rnn.ops import gen_lstm_ops
+from tensorflow.python.ops import gen_rnn_ops as gen_lstm_ops
 from tensorflow.python.framework import function
 from tensorflow.python.layers import base as base_layer
 
diff --git a/TensorFlow/Translation/GNMT/estimator.py b/TensorFlow/Translation/GNMT/estimator.py
index a0e7fc5..72e725a 100644
--- a/TensorFlow/Translation/GNMT/estimator.py
+++ b/TensorFlow/Translation/GNMT/estimator.py
@@ -214,6 +214,9 @@ class ModelFnFactory(object):
       opt = tf.train.AdamOptimizer(learning_rate)
     else:
       raise ValueError("Unknown optimizer type %s" % hparams.optimizer)
+
+    if hparams.use_amp:
+      opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)
     return opt
 
   def _compute_tower_grads(self, tower_loss, tower_params, learning_rate, use_fp16=False,
@@ -712,10 +715,11 @@ def get_sacrebleu(trans_file, detokenizer_file):
   return float(score)
 
 
-def get_metrics(hparams, model_fn, ckpt=None, only_translate=False):
+def get_metrics(hparams, model_fn, pred_estimator=None, ckpt=None, only_translate=False):
   """Run inference and compute metrics."""
-  pred_estimator = tf.estimator.Estimator(
-      model_fn=model_fn, model_dir=hparams.output_dir)
+  if pred_estimator is None:
+    pred_estimator = tf.estimator.Estimator(
+        model_fn=model_fn, model_dir=hparams.output_dir)
 
   benchmark_hook = BenchmarkHook(hparams.infer_batch_size)
 
@@ -836,4 +840,12 @@ def train_fn(hparams):
 
 def eval_fn(hparams, ckpt=None, only_translate=False):
   model_fn = make_model_fn(hparams)
-  return get_metrics(hparams, model_fn, ckpt, only_translate=only_translate)
+  sess_config = tf.ConfigProto(allow_soft_placement=True)
+  if hparams.use_amp:
+    sess_config.graph_options.rewrite_options.auto_mixed_precision = 1
+  config = tf.estimator.RunConfig(
+        log_step_count_steps=hparams.log_step_count_steps,
+        session_config=sess_config)
+  pred_estimator = tf.estimator.Estimator(
+      model_fn=model_fn, model_dir=hparams.output_dir, config=config)
+  return get_metrics(hparams, model_fn, pred_estimator, ckpt, only_translate=only_translate)

@mankeyboy
Copy link
Author

Thank you for the patch. Your patch is doing exactly what I was intending to and I retested my code with TF-1.15 and it seems to be working while TF1.14 was throwing the error I'd posted in my original issue. Is the error showing for TF1.14 a design change or a bug in 1.14?

@maciej-sypetkowski
Copy link

maciej-sypetkowski commented Nov 14, 2019

It was a bug in 1.14 and it has been fixed in 1.15

@mankeyboy
Copy link
Author

Thanks, I'll close this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants