Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values #1017

Closed
DuanWei-fudan opened this issue Dec 1, 2020 · 3 comments
Closed
Assignees

Comments

@DuanWei-fudan
Copy link

DuanWei-fudan commented Dec 1, 2020

my code

deeplabcut.train_network(config_path,shuffle=1,trainingsetindex=0,max_snapshots_to_keep=5,displayiters=10,saveiters=100, maxiters=10000)
Starting training....
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1333     try:
-> 1334       return fn(*args)
   1335     except errors.OpError as e:

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
   1318       return self._call_tf_sessionrun(
-> 1319           options, feed_dict, fetch_list, target_list, run_metadata)
   1320 

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
   1406         self._session, options, feed_dict, fetch_list, target_list,
-> 1407         run_metadata)
   1408 

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
	 [[{{node train_op/CheckNumerics}}]]
	 [[{{node train_op/control_dependency}}]]

During handling of the above exception, another exception occurred:

InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-10-f3dfacdfe8e2> in <module>
----> 1 deeplabcut.train_network(config_path,shuffle=1,trainingsetindex=0,max_snapshots_to_keep=5,displayiters=10,saveiters=100, maxiters=10000)

~\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights)
    132         train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
    133     except BaseException as e:
--> 134         raise e
    135     finally:
    136         os.chdir(str(start_path))

~\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py in train_network(config, shuffle, trainingsetindex, max_snapshots_to_keep, displayiters, saveiters, maxiters, allow_growth, gputouse, autotune, keepdeconvweights)
    130             os.environ['CUDA_VISIBLE_DEVICES'] = str(gputouse)
    131     try:
--> 132         train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
    133     except BaseException as e:
    134         raise e

~\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py in train(config_yaml, displayiters, saveiters, maxiters, max_to_keep, keepdeconvweights, allow_growth)
    188         current_lr = lr_gen.get_lr(it)
    189         [_, loss_val, summary] = sess.run([train_op, total_loss, merged_summaries],
--> 190                                           feed_dict={learning_rate: current_lr})
    191         cum_loss += loss_val
    192         train_writer.add_summary(summary, it)

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
    927     try:
    928       result = self._run(None, fetches, feed_dict, options_ptr,
--> 929                          run_metadata_ptr)
    930       if run_metadata:
    931         proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
   1150     if final_fetches or final_targets or (handle and feed_dict_tensor):
   1151       results = self._do_run(handle, final_targets, final_fetches,
-> 1152                              feed_dict_tensor, options, run_metadata)
   1153     else:
   1154       results = []

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
   1326     if handle is None:
   1327       return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1328                            run_metadata)
   1329     else:
   1330       return self._do_call(_prun_fn, handle, feeds, fetches)

~\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\client\session.py in _do_call(self, fn, *args)
   1346           pass
   1347       message = error_interpolation.interpolate(message, self._graph)
-> 1348       raise type(e)(node_def, op, message)
   1349 
   1350   def _extend_graph(self):

InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values
	 [[node train_op/CheckNumerics (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]
	 [[node train_op/control_dependency (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]

Caused by op 'train_op/CheckNumerics', defined at:
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\traitlets\config\application.py", line 845, in launch_instance
    app.start()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelapp.py", line 612, in start
    self.io_loop.start()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\platform\asyncio.py", line 149, in start
    self.asyncio_loop.run_forever()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\asyncio\base_events.py", line 541, in run_forever
    self._run_once()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\asyncio\base_events.py", line 1786, in _run_once
    handle._run()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\asyncio\events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback
    ret = callback()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 787, in inner
    self.run()
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelbase.py", line 268, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\kernelbase.py", line 545, in execute_request
    user_expressions, allow_stdin,
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\ipkernel.py", line 306, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2878, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 2923, in _run_cell
    return runner(coro)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 3147, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 3338, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-f3dfacdfe8e2>", line 1, in <module>
    deeplabcut.train_network(config_path,shuffle=1,trainingsetindex=0,max_snapshots_to_keep=5,displayiters=10,saveiters=100, maxiters=10000)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\training.py", line 132, in train_network
    train(str(poseconfigfile),displayiters,saveiters,maxiters,max_to_keep=max_snapshots_to_keep,keepdeconvweights=keepdeconvweights,allow_growth=allow_growth) #pass on path and file name for pose_cfg.yaml!
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 151, in train
    learning_rate, train_op = get_optimizer(total_loss, cfg)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py", line 102, in get_optimizer
    train_op = slim.learning.create_train_op(loss_op, optimizer)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py", line 439, in create_train_op
    check_numerics=check_numerics)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\contrib\training\python\training\training.py", line 464, in create_train_op
    'LossTensor is inf or nan')
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 919, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\util\deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\framework\ops.py", line 3300, in create_op
    op_def=op_def)
  File "C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\tensorflow\python\framework\ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): LossTensor is inf or nan : Tensor had NaN values
	 [[node train_op/CheckNumerics (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]
	 [[node train_op/control_dependency (defined at C:\Users\Administrator\anaconda3\envs\DLC-GPU\lib\site-packages\deeplabcut\pose_estimation_tensorflow\train.py:102) ]]
@MMathisLab
Copy link
Member

HI @DuanWei-fudan - what version of tensorflow are you using? I would ask you to fill out the template completely so I can better help!

@MMathisLab MMathisLab self-assigned this Dec 5, 2020
@MMathisLab MMathisLab changed the title when i learn the deeplabcut to train the network,a error has happened,what should i do InvalidArgumentError: LossTensor is inf or nan : Tensor had NaN values Dec 5, 2020
@DuanWei-fudan
Copy link
Author

hello,thanks for helping firstly,i use the tensorflow-gpu 1.13.1 with cudatoolkit 10.0.130,cudnn 7.6.5 right now.I used the tensorflow-gpu 2.0 ,but the code had some problem,so i change it. I just know a little about the deeplabcut, so i don't know how to solve it. Can I connect with you with wechat or others?Maybe it will be more problems in the future.
Well,i think more computer information is useful:
windows 10 for 64x
AMD Ryzen Threadripper 3960x 24-core Professor 3.79GHz
NVIDIA GeForce RTX 3090.

maybe the tensorflow with 1.13.1 isn't best for my computer?

@MMathisLab
Copy link
Member

MMathisLab commented Dec 6, 2020

Hi @DuanWei-fudan -- easy fix! The 3000 series does not work with tensorflow 1.x; so in short you need to use tensorflow 2, but this requires a new DLC update.

Here is the blog about this: http://www.mousemotorlab.org/deeplabcutblog/2020/11/23/rolling-up-to-tensorflow-2

What you can do is install this package:

#Install the branch with tf2.x support:
pip install git+https://github.com/DeepLabCut/DeepLabCut-core.git@tf2.2alpha
pip install tf_slim

and use the project you already made, just make a new training_dataset!

Here is a COLAB notebook to show you what to do: https://colab.research.google.com/github/DeepLabCut/DeepLabCut-core/blob/tf2.2alpha/Colab_TrainNetwork_VideoAnalysis_TF2.ipynb

also see this issue: #944

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants