Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of Memory on Small Dataset #151

Closed
stevesmit opened this issue Oct 4, 2018 · 12 comments
Closed

Out of Memory on Small Dataset #151

stevesmit opened this issue Oct 4, 2018 · 12 comments
Assignees

Comments

@stevesmit
Copy link

Describe the bug
When attempting to train a classifier on a small dataset of 8,000 documents, I get an out of memory error and the script stops running.

Minimum Reproducible Example
Version of finetune = 0.4.1
Version of tensorflow-gpu = 1.8.0
Version of cuda = release 9.0, V9.0.176
Windows 10 Pro
Load a dataset of documents (X_train) and labels (Y_train), where each document and label is simply a string.
model = finetune.Classifier(max_length = 256, batch_size = 1) #tried reducing the memory footprint
model.fit(X_train, Y_train)

Expected behavior
I expected the model to train, but it doesn't manage to start training.

Additional context
I get the following warnings in the jupyter notebook:

C:\Users...\Python35\site-packages\finetune\encoding.py:294: UserWarning: Some examples are longer than the max_length. Please trim documents or increase max_length. Fallback behaviour is to use the first 254 byte-pair encoded tokens
"Fallback behaviour is to use the first {} byte-pair encoded tokens".format(max_length - 2)
C:\Users...\Python35\site-packages\finetune\encoding.py:233: UserWarning: Document is longer than max length allowed, trimming document to 256 tokens.
max_length
C:\Users...\tensorflow\python\ops\gradients_impl.py: 100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
WARNING:tensorflow:From C:\Users...\tensorflow\python\util\tf_should_use.py:118: initialize_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use tf.variables_initializer instead.

And then I get the following diagnostic info showing up in the command prompt:

2018-10-04 17:26:36.920118: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-10-04 17:26:37.716883: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: Quadro M1200 major: 5 minor: 0 memoryClockRate(GHz): 1.148
pciBusID: 0000:01:00.0
totalMemory: 4.00GiB freeMemory: 3.35GiB
2018-10-04 17:26:37.725637: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2018-10-04 17:26:38.412484: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-04 17:26:38.417413: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2018-10-04 17:26:38.419392: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2018-10-04 17:26:38.421353: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/device:GPU:0 with 3083 MB memory) -> physical GPU (device: 0, name: Quadro M1200, pci bus id: 0000:01:00.0, compute capability: 5.0)
[I 17:28:26.081 NotebookApp] Saving file at /projects/language-models/Finetune Package.ipynb
2018-10-04 17:29:14.118663: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2018-10-04 17:29:14.123595: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-04 17:29:14.127649: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2018-10-04 17:29:14.135411: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2018-10-04 17:29:14.138698: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3083 MB memory) -> physical GPU (device: 0, name: Quadro M1200, pci bus id: 0000:01:00.0, compute capability: 5.0)
2018-10-04 17:30:06.881174: W T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 9.00MiB. Current allocation summary follows.
2018-10-04 17:30:06.900550: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (256):
Total Chunks: 60, Chunks in use: 60. 15.0KiB allocated for chunks. 15.0KiB in use in bin. 312B client-requested in use in bin.
2018-10-04 17:30:06.929551: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (512):
Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:06.964647: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (1024): Total Chunks: 2, Chunks in use: 2. 2.5KiB allocated for chunks. 2.5KiB in use in bin. 2.0KiB client-requested in use in bin.
2018-10-04 17:30:06.995394: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (2048): Total Chunks: 532, Chunks in use: 532. 1.56MiB allocated for chunks. 1.56MiB in use in bin. 1.56MiB client-requested in use in bin.
2018-10-04 17:30:07.031613: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.061013: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (8192): Total Chunks: 137, Chunks in use: 137. 1.39MiB allocated for chunks. 1.39MiB in use in bin. 1.39MiB client-requested in use in bin.
2018-10-04 17:30:07.093603: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.130530: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.170321: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.212730: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.246329: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (262144): Total Chunks: 2, Chunks in use: 2. 512.0KiB allocated for chunks. 512.0KiB in use in bin. 512.0KiB client-requested in use in bin.
2018-10-04 17:30:07.288640: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.303248: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.332990: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (2097152): Total Chunks: 71, Chunks in use: 71. 159.75MiB allocated for chunks. 159.75MiB in use in bin. 159.75MiB client-requested in use in bin.
2018-10-04 17:30:07.364897: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (4194304): Total Chunks: 69, Chunks in use: 68. 466.99MiB allocated for chunks. 459.00MiB in use in bin. 459.00MiB client-requested in use in bin.
2018-10-04 17:30:07.396862: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (8388608): Total Chunks: 140, Chunks in use: 140. 1.23GiB allocated for chunks. 1.23GiB in use in bin. 1.23GiB client-requested in use in bin.
2018-10-04 17:30:07.428029: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.464813: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.494067: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (67108864): Total Chunks: 10, Chunks in use: 10. 1.17GiB allocated for chunks. 1.17GiB in use in bin. 1.17GiB client-requested in use in bin.
2018-10-04 17:30:07.524156: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.550345: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:630] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2018-10-04 17:30:07.578392: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:646] Bin for 9.00MiB was 8.00MiB, Chunk State:
2018-10-04 17:30:07.600123: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000801980000 of size 1280
2018-10-04 17:30:07.629493: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000801980500 of size 1280
2018-10-04 17:30:07.649189: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000801980A00 of size 125144064
2018-10-04 17:30:07.676965: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 00000008090D9600 of size 7077888
2018-10-04 17:30:07.699245: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 0000000809799600 of size 3072
2018-10-04 17:30:07.718738: I T:\src\github\tensorflow\tensorflow\core\common_runtime\bfc_allocator.cc:665] Chunk at 000000080979A200 of size 3072

...and so on. This is, in my opinion a pretty small dataset and I've made the max characters pretty small so I don't think this is a hardware limitation, but a bug.

@madisonmay
Copy link
Contributor

Hi @stevesmit,

Could you try again with low_memory_mode=True?

i.e.:

model = finetune.Classifier(max_length = 256, batch_size = 1, low_memory_mode=True)

@madisonmay
Copy link
Contributor

I agree this isn't expected behavior though. With tensorflow's gpu_options.allow_growth option set to True and computation isolated to a single GTX 980 I see the following in nvidia-smi with the configuration you're running:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980     Off  | 00000000:01:00.0 Off |                  N/A |
| 29%   50C    P2    72W / 180W |   2448MiB /  4043MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+

So that's 2.5GB of GPU memory required when using identical settings to you. Will have to dig into this issue further, might need some more info on your side to reproduce here in a bit.

@madisonmay
Copy link
Contributor

As an aside, this issue should be independent of dataset size. Dataset size increase CPU memory usage but not GPU memory usage as each batch is sent to the GPU independently, so only batch_size and max_length modifications should cause GPU memory usage to vary.

@stevesmit
Copy link
Author

Hi @stevesmit,

Could you try again with low_memory_mode=True?

i.e.:

model = finetune.Classifier(max_length = 256, batch_size = 1, low_memory_mode=True)

I tried that but I got the same error 👎

@madisonmay madisonmay self-assigned this Oct 5, 2018
@madisonmay
Copy link
Contributor

I've managed to replicate your problem by reverting to 0.4.1 and installing tensorflow-gpu==1.8.0.

I believe installing the development branch of this github repo will resolve things for you. I'm not 100% certain of the root cause but my guess is it's related to whether not a portion of the language model graph is built when it is not required.

If you experience further issues after installing the development version of the repo, the next option is to use the lighter weight version of the model.

from finetune.config import get_small_model_config

model = finetune.Classifier(config=get_small_model_config(), batch_size=1, max_length=256)

Sorry for the difficulty! This warrants a new pypi image associated with the dev branch -- will hopefully have a 0.4.2 release out here shortly.

--Madison

@stevesmit
Copy link
Author

Hi Madison

I cloned the development branch and ran the setup instructions in the Readme, but sadly I got the same error as before. Furthermore, from finetune.config import get_small_model_config didn't work:

cannot import name 'get_small_model_config'

@madisonmay
Copy link
Contributor

Hrmmm... seems to me like the setup didn't complete for one reason or another. That function definitely exists in the development branch. Perhaps the python setup.py develop or python setup.py install command failed with a permissions error?

get_small_model_config is here: https://github.com/IndicoDataSolutions/finetune/blob/development/finetune/config.py#L204

@madisonmay
Copy link
Contributor

madisonmay commented Oct 19, 2018

I can't be sure, but we've just pushed an update to finetune (0.5.8) that may resolve your issue. If you've installed via pip you can upgrade via:

sudo pip install finetune --upgrade

@madisonmay
Copy link
Contributor

Hi @stevesmit, just checking in again. Have you had a chance to try this out on your machine? Curious to know if the refactor to the tf.Estimator API helped to resolve your issue.

@stevesmit
Copy link
Author

@madisonmay Sadly I'm still having trouble. Upgraded to 0.5.11.

After some ResourceExhaustedErrors I toned down the parameters and used a toy example dataset. It manages to start finetuning but ends up with a weird error. Here's my code and the error:

model = finetune.Classifier(max_length = 32, batch_size = 1)
X_train_dummy = ["Hello there" if random.randint(1, 10) < 5 else "Bonjour" for i in range(200)]
Y_train_dummy = ["en" if random.randint(1, 10) < 5 else "fr" for i in range(200)]
model.fit(X_train_dummy, Y_train_dummy)

Here's the full traceback:

InternalError Traceback (most recent call last)
~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1333 try:
-> 1334 return fn(*args)
1335 except errors.OpError as e:

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1318 return self._call_tf_sessionrun(
-> 1319 options, feed_dict, fetch_list, target_list, run_metadata)
1320

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1406 self._session, options, feed_dict, fetch_list, target_list,
-> 1407 run_metadata)
1408

InternalError: Dst tensor is not initialized.
[[{{node _arg_Placeholder_134_0_38/_161}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_527__arg_Placeholder_134_0_38", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

InternalError Traceback (most recent call last)
in
----> 1 model.fit(X_train_dummy, Y_train_dummy) # Finetune base model on custom data

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\finetune\base.py in fit(self, *args, **kwargs)
233 def fit(self, *args, **kwargs):
234 """ An alias for finetune. """
--> 235 return self.finetune(*args, **kwargs)
236
237 def _predict(self, Xs):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\finetune\classifier.py in finetune(self, X, Y, batch_size)
66 corresponds to the number of training examples provided to each GPU.
67 """
---> 68 return super().finetune(X, Y=Y, batch_size=batch_size)
69
70 def get_eval_fn(cls):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\finetune\base.py in finetune(self, Xs, Y, batch_size)
169 with warnings.catch_warnings():
170 warnings.simplefilter("ignore")
--> 171 estimator.train(train_input_fn, hooks=train_hooks, steps=num_steps)
172
173 def get_estimator(self, force_build_lm=False):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\estimator\estimator.py in train(self, input_fn, hooks, steps, max_steps, saving_listeners)
352
353 saving_listeners = _check_listeners_type(saving_listeners)
--> 354 loss = self._train_model(input_fn, hooks, saving_listeners)
355 logging.info('Loss for final step: %s.', loss)
356 return self

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\estimator\estimator.py in _train_model(self, input_fn, hooks, saving_listeners)
1205 return self._train_model_distributed(input_fn, hooks, saving_listeners)
1206 else:
-> 1207 return self._train_model_default(input_fn, hooks, saving_listeners)
1208
1209 def _train_model_default(self, input_fn, hooks, saving_listeners):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\estimator\estimator.py in _train_model_default(self, input_fn, hooks, saving_listeners)
1239 return self._train_with_estimator_spec(estimator_spec, worker_hooks,
1240 hooks, global_step_tensor,
-> 1241 saving_listeners)
1242
1243 def _train_model_distributed(self, input_fn, hooks, saving_listeners):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\estimator\estimator.py in _train_with_estimator_spec(self, estimator_spec, worker_hooks, hooks, global_step_tensor, saving_listeners)
1469 loss = None
1470 while not mon_sess.should_stop():
-> 1471 _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
1472 return loss
1473

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
669 feed_dict=feed_dict,
670 options=options,
--> 671 run_metadata=run_metadata)
672
673 def run_step_fn(self, step_fn):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
1154 feed_dict=feed_dict,
1155 options=options,
-> 1156 run_metadata=run_metadata)
1157 except _PREEMPTION_ERRORS as e:
1158 logging.info('An error was raised. This may be due to a preemption in '

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in run(self, *args, **kwargs)
1253 raise six.reraise(*original_exc_info)
1254 else:
-> 1255 raise six.reraise(*original_exc_info)
1256
1257

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\six.py in reraise(tp, value, tb)
691 if value.traceback is not tb:
692 raise value.with_traceback(tb)
--> 693 raise value
694 finally:
695 value = None

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in run(self, *args, **kwargs)
1238 def run(self, *args, **kwargs):
1239 try:
-> 1240 return self._sess.run(*args, **kwargs)
1241 except _PREEMPTION_ERRORS:
1242 raise

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in run(self, fetches, feed_dict, options, run_metadata)
1318 results=outputs[hook] if hook in outputs else None,
1319 options=options,
-> 1320 run_metadata=run_metadata))
1321 self._should_stop = self._should_stop or run_context.stop_requested
1322

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\contrib\estimator\python\estimator\hooks.py in after_run(self, run_context, run_values)
208 self._iter_count += 1
209 if self._timer.should_trigger_for_step(self._iter_count):
--> 210 self._evaluate(run_context.session)
211
212 def end(self, session): # pylint: disable=unused-argument

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\contrib\estimator\python\estimator\hooks.py in _evaluate(self, train_session)
200 eval_dict=self._eval_dict,
201 all_hooks=self._all_hooks,
--> 202 output_dir=self._eval_dir)
203
204 self._timer.update_last_triggered_step(self._iter_count)

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\estimator\estimator.py in _evaluate_run(self, checkpoint_path, scaffold, update_op, eval_dict, all_hooks, output_dir)
1589 final_ops=eval_dict,
1590 hooks=all_hooks,
-> 1591 config=self._session_config)
1592
1593 current_global_step = eval_results[ops.GraphKeys.GLOBAL_STEP]

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\evaluation.py in _evaluate_once(checkpoint_path, master, scaffold, eval_ops, feed_dict, final_ops, final_ops_feed_dict, hooks, config)
269
270 with monitored_session.MonitoredSession(
--> 271 session_creator=session_creator, hooks=hooks) as session:
272 if eval_ops is not None:
273 while not session.should_stop():

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in init(self, session_creator, hooks, stop_grace_period_secs)
919 super(MonitoredSession, self).init(
920 session_creator, hooks, should_recover=True,
--> 921 stop_grace_period_secs=stop_grace_period_secs)
922
923

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in init(self, session_creator, hooks, should_recover, stop_grace_period_secs)
641 stop_grace_period_secs=stop_grace_period_secs)
642 if should_recover:
--> 643 self._sess = _RecoverableSession(self._coordinated_creator)
644 else:
645 self._sess = self._coordinated_creator.create_session()

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in init(self, sess_creator)
1105 """
1106 self._sess_creator = sess_creator
-> 1107 _WrappedSession.init(self, self._create_session())
1108
1109 def _create_session(self):

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in _create_session(self)
1110 while True:
1111 try:
-> 1112 return self._sess_creator.create_session()
1113 except _PREEMPTION_ERRORS as e:
1114 logging.info('An error was raised while a session was being created. '

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in create_session(self)
798 """Creates a coordinated session."""
799 # Keep the tf_sess for unit testing.
--> 800 self.tf_sess = self._session_creator.create_session()
801 # We don't want coordinator to suppress any exception.
802 self.coord = coordinator.Coordinator(clean_stop_exception_types=[])

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in create_session(self)
564 init_op=self._scaffold.init_op,
565 init_feed_dict=self._scaffold.init_feed_dict,
--> 566 init_fn=self._scaffold.init_fn)
567
568

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\session_manager.py in prepare_session(self, master, init_op, saver, checkpoint_dir, checkpoint_filename_with_path, wait_for_checkpoint, max_wait_secs, config, init_feed_dict, init_fn)
294 sess.run(init_op, feed_dict=init_feed_dict)
295 if init_fn:
--> 296 init_fn(sess)
297
298 local_init_success, msg = self._try_run_local_init_op(sess)

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\training\monitored_session.py in (sess)
162 self._user_init_fn = init_fn
163 if init_fn:
--> 164 self._init_fn = lambda sess: init_fn(self, sess)
165 else:
166 self._init_fn = None

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\contrib\estimator\python\estimator\hooks.py in feed_variables(failed resolving arguments)
188 def feed_variables(scaffold, session):
189 del scaffold
--> 190 session.run(self._var_feed_op, feed_dict=placeholder_to_value)
191
192 scaffold = training.Scaffold(

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
927 try:
928 result = self._run(None, fetches, feed_dict, options_ptr,
--> 929 run_metadata_ptr)
930 if run_metadata:
931 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1150 if final_fetches or final_targets or (handle and feed_dict_tensor):
1151 results = self._do_run(handle, final_targets, final_fetches,
-> 1152 feed_dict_tensor, options, run_metadata)
1153 else:
1154 results = []

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1326 if handle is None:
1327 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328 run_metadata)
1329 else:
1330 return self._do_call(_prun_fn, handle, feeds, fetches)

~\AppData\Local\Continuum\Miniconda3\envs\dl\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
1346 pass
1347 message = error_interpolation.interpolate(message, self._graph)
-> 1348 raise type(e)(node_def, op, message)
1349
1350 def _extend_graph(self):

InternalError: Dst tensor is not initialized.
[[{{node _arg_Placeholder_134_0_38/_161}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_527__arg_Placeholder_134_0_38", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

@madisonmay
Copy link
Contributor

Hi there,

I'm still at a loss as to what's going on in your environment, but it may be worth trying to run this in a docker container through nvidia runtime or on a linux machine if you have access to one -- seems like there's a reasonable chance this issue is an artifact of running on a windows environment.

The new estimator finetune version also requires tf==1.11.0 -- it's worth checking to make sure that this requirement is satisfied.

DST tensor is not initialized is essentially the same as any other OOM error -- it's always been the same problem by a different name in my experience.

Sorry for the trouble, hope you're able to get something sorted out.

--Madison

@madisonmay
Copy link
Contributor

Going to close this down due to lack of activity. Feel free to re-open if you have more information we might be able to use to help diagnose the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants