Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: The input size of each GPU must be 1 in testing mode #99

Closed
zhang123-sys opened this issue Dec 17, 2022 · 5 comments
Closed

Comments

@zhang123-sys
Copy link

After training GaitGL for 80000 epochs, I test GaitGL by running "CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port 12345 --nproc_per_node=1 opengait/main.py --cfgs ./config/gaitgl/gaitgl.yaml --phase test", but get "ValueError: The input size of each GPU must be 1 in testing mode, but got 4!"
image

@zhang123-sys
Copy link
Author

zhang123-sys commented Dec 17, 2022

Traceback (most recent call last):
  File "opengait/main.py", line 71, in <module>
    run_model(cfgs, training)
  File "opengait/main.py", line 56, in run_model
    Model.run_test(model)
  File "/hy-nas/OpenGait/opengait/modeling/base_model.py", line 439, in run_test
    info_dict = model.inference(rank)
  File "/hy-nas/OpenGait/opengait/modeling/base_model.py", line 378, in inference
    retval = self.forward(ipts)
  File "/hy-nas/OpenGait/opengait/modeling/models/gaitgl.py", line 157, in forward
    'The input size of each GPU must be 1 in testing mode, but got {}!'.format(len(labs)))
ValueError: The input size of each GPU must be 1 in testing mode, but got 4!
Transforming:   0%|                                                                                                                                      | 0/5485 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6432) of binary: /usr/local/miniconda3/envs/OpenGait/bin/python
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/miniconda3/envs/OpenGait/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
opengait/main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-12-17_11:03:35
  host      : Idddbe339d00c011ff
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 6432)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

@ChaoFan996
Copy link
Collaborator

Hi, set batch size to 1 here, please.

@ChaoFan996
Copy link
Collaborator

In OpenGait, we should make batch size equal to the number of used GPUs during testing for models that perform compressing operations over temporal dimension, i.e., GaitGL.

@ChaoFan996
Copy link
Collaborator

Moreover, the same to the training phase once we take the unfix-length sequence as input.

@ChaoFan996 ChaoFan996 changed the title test GaitGL ValueError: The input size of each GPU must be 1 in testing mode Dec 17, 2022
@zhang123-sys
Copy link
Author

Ok,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants