InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values #1017

DuanWei-fudan · 2020-12-01T07:39:14Z

my code

deeplabcut.train_network(conﬁg_path,shufﬂe=1,trainingsetindex=0,max_snapshots_to_keep=5,displayiters=10,saveiters=100, maxiters=10000)
Starting training....
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
	 [[{{node train_op/CheckNumerics}}]]
	 [[{{node train_op/control_dependency}}]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-10-f3dfacdfe8e2> in <module>
----> 1 deeplabcut.train_network(conﬁg_path,shufﬂe=1,trainingsetindex=0,max_snapshots_to_keep=5,displayiters=10,saveiters=100, maxiters=10000)

~\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights)
    132         train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
    133     except BaseException as e:
--> 134         raise e
    135     finally:
    136         os.chdir(str(start_path))

~\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights)
    130             os.environ['CUDA_VISIBLE_DEVICES'] = str(gputouse)
    131     try:
--> 132         train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
    133     except BaseException as e:
    134         raise e

~\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py in train(config_yaml, displayiters, saveiters, maxiters, max_to_keep, keepdeconvweights, allow_growth)
    188         current_lr = lr_gen.get_lr(it)
    189         [_, loss_val, summary] = sess.run([train_op, total_loss, merged_summaries],
--> 190                                           feed_dict={learning_rate: current_lr})
    191         cum_loss += loss_val
    192         train_writer.add_summary(summary, it)

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
	 [[node train_op/CheckNumerics (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]
	 [[node train_op/control_dependency (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]

Caused by op 'train_op/CheckNumerics', defined at:
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\traitlets\config\application.py", line 845, in launch_instance
    app.start()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelapp.py", line 612, in start
    self.io_loop.start()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\platform\asyncio.py", line 149, in start
    self.asyncio_loop.run_forever()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\asyncio\base_events.py", line 541, in run_forever
    self._run_once()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\asyncio\base_events.py", line 1786, in _run_once
    handle._run()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\asyncio\events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback
    ret = callback()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 787, in inner
    self.run()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelbase.py", line 545, in execute_request
    user_expressions, allow_stdin,
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2878, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2923, in _run_cell
    return runner(coro)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 3147, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 3338, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-f3dfacdfe8e2>", line 1, in <module>
    deeplabcut.train_network(conﬁg_path,shufﬂe=1,trainingsetindex=0,max_snapshots_to_keep=5,displayiters=10,saveiters=100, maxiters=10000)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py", line 132, in train_network
    train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 151, in train
    learning_rate, train_op = get_optimizer(total_loss, cfg)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 102, in get_optimizer
    train_op = slim.learning.create_train_op(loss_op, optimizer)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 439, in create_train_op
    check_numerics=check_numerics)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\contrib\training\python\training\training.py", line 464, in create_train_op
    'LossTensor is inf or nan')
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
    op_def=op_def)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values
	 [[node train_op/CheckNumerics (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]
	 [[node train_op/control_dependency (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]

MMathisLab · 2020-12-05T17:31:56Z

HI @DuanWei-fudan - what version of tensorflow are you using? I would ask you to fill out the template completely so I can better help!

DuanWei-fudan · 2020-12-06T12:23:30Z

hello,thanks for helping firstly,i use the tensorflow-gpu 1.13.1 with cudatoolkit 10.0.130,cudnn 7.6.5 right now.I used the tensorflow-gpu 2.0 ,but the code had some problem,so i change it. I just know a little about the deeplabcut, so i don't know how to solve it. Can I connect with you with wechat or others?Maybe it will be more problems in the future.
Well,i think more computer information is useful:
windows 10 for 64x
AMD Ryzen Threadripper 3960x 24-core Professor 3.79GHz
NVIDIA GeForce RTX 3090.

maybe the tensorflow with 1.13.1 isn't best for my computer?

MMathisLab · 2020-12-06T17:26:30Z

Hi @DuanWei-fudan -- easy fix! The 3000 series does not work with tensorflow 1.x; so in short you need to use tensorflow 2, but this requires a new DLC update.

Here is the blog about this: http://www.mousemotorlab.org/deeplabcutblog/2020/11/23/rolling-up-to-tensorflow-2

What you can do is install this package:

#Install the branch with tf2.x support:
pip install git+https://github.com/DeepLabCut/DeepLabCut-core.git@tf2.2alpha
pip install tf_slim

and use the project you already made, just make a new training_dataset!

Here is a COLAB notebook to show you what to do: https://colab.research.google.com/github/DeepLabCut/DeepLabCut-core/blob/tf2.2alpha/Colab_TrainNetwork_VideoAnalysis_TF2.ipynb

also see this issue: #944

MMathisLab added the tensorflow/training label Dec 5, 2020

MMathisLab self-assigned this Dec 5, 2020

MMathisLab changed the title ~~when i learn the deeplabcut to train the network,a error has happened,what should i do~~ InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values Dec 5, 2020

MMathisLab closed this as completed Dec 6, 2020

MMathisLab mentioned this issue Dec 7, 2020

cannot import name 'GceClusterResolver' #1024

Closed

MMathisLab mentioned this issue Dec 22, 2020

No module named 'tensorflow.contrib #1042

Closed

jeylau mentioned this issue Jan 15, 2021

RTX 3070: LossTensor is inf or nan #1082

Closed

Ejdrup mentioned this issue Mar 13, 2021

All predictions place in top left-hand corner [ RTX 3*** does NOT work with TensorFlow 1.x! == odd errors! Please use deeplabcutcore ] #1142

Closed

MMathisLab mentioned this issue May 25, 2023

Graph Execution Error when Training Network #2248

Closed

2 tasks

luofangcheng mentioned this issue Jul 13, 2023

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values Be-bo/VisualGaitLab#12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values #1017

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values #1017

DuanWei-fudan commented Dec 1, 2020 •

edited by MMathisLab

Loading

MMathisLab commented Dec 5, 2020

DuanWei-fudan commented Dec 6, 2020

MMathisLab commented Dec 6, 2020 •

edited

Loading

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values #1017

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values #1017

Comments

DuanWei-fudan commented Dec 1, 2020 • edited by MMathisLab Loading

MMathisLab commented Dec 5, 2020

DuanWei-fudan commented Dec 6, 2020

MMathisLab commented Dec 6, 2020 • edited Loading

DuanWei-fudan commented Dec 1, 2020 •

edited by MMathisLab

Loading

MMathisLab commented Dec 6, 2020 •

edited

Loading