Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Auto Micro Batch] After enable Auto Micro Batch feature in WDL of modelzoo, but get an error. #127

Open
Duyi-Wang opened this issue Mar 23, 2022 · 0 comments

Comments

@Duyi-Wang
Copy link
Contributor

I want to enable Auto Micro Batch feature in WDL and follow the steps in DeepRec Docs, but I get an error.

Code to reproduce the issue
I use following codes to enable Auto Graph Fusion. The full code please see Full code

        if args.op_fusion and not args.tf:
            '''Auto Graph Fusion'''
            sess_config.graph_options.optimizer_options.do_op_fusion = True

Run python train.py --steps 1000 --no_eval --micro_batch 2 can reproduce error. Use WDL dataset.
When set --micro_batch(micro_batch_num) to 1, it's OK.
"AutoMicroBatch功能依赖于用户开启图优化的选项" means Auto Graph Fusion? It can be enabled by --op_fusion True, but get the same error. And I also get terrible in enabling Auto Graph Fusion, see issue #126

This seems to be because of the initialization of dataset in MonitorTrainingSession. So this issue is different from #86 which use tf.Session().

logs

INFO:tensorflow:Parsing ./data/train.csv
INFO:tensorflow:Parsing ./data/eval.csv
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Init incremental saver , incremental_save:False, incremental_path:./result/model_WIDE_AND_DEEP_1648002155/.incremental_checkpoint/incremental_model.ckpt
INFO:tensorflow:Graph was finalized.
2022-03-23 10:22:39.913346: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000000000 Hz
2022-03-23 10:22:39.932151: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x556fea568950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-03-23 10:22:39.932183: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:run without loading checkpoint
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./result/model_WIDE_AND_DEEP_1648002155/model.ckpt.
INFO:tensorflow:Create incremental timer, incremental_save:False, incremental_save_secs:None
Using TensorFlow version 1.15.5
Checking dataset...
Numbers of training dataset is 8000000
Numbers of test dataset is 2000000
The training steps is 100
The testing steps is 7813
Saving model checkpoints to ./result/model_WIDE_AND_DEEP_1648002155
Traceback (most recent call last):
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
	 [[{{node IteratorGetNext_1/dup0}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train_rebuild.py", line 746, in <module>
    main()
  File "train_rebuild.py", line 542, in main
    checkpoint_dir, tf_config, server)
  File "train_rebuild.py", line 414, in train
    sess.run([model.loss, model.train_op])
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 804, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1309, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1410, in run
    raise six.reraise(*original_exc_info)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1395, in run
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1468, in run
    run_metadata=run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/training/monitored_session.py", line 1226, in run
    return self._sess.run(*args, **kwargs)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/duyi/miniconda3/envs/deeprec/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
	 [[{{node IteratorGetNext_1/dup0}}]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant