Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... #1

Closed
Amazingren opened this issue Apr 10, 2020 · 6 comments

Comments

@Amazingren
Copy link

Thanks for your nice style codebase.
However, when I try to train TSM in your codebase, there is a problem which stoped me from training it.
(1)The log file stop at: 2020-04-10 xxxx094-models.py#177: Freezing BatchNorm2D except the first one, and I wait it for 10 min but with no continue update.
(2)When I use 'gpustat' check the usage of gpu, it shows only about 800M data in each gpu(I use 8 in total)

I am sorry for disturbing you, while as a green hand also would be appreaciate if you could show me some light.

@deepcs233
Copy link
Collaborator

Thanks for your feedback. You can set trainer.no_partial_bn = True if batch size >= 6 in each gpu and retry it, this will not affect the accuracy. That module exists some bug with distributed training, we will fix it quickly.

@Amazingren
Copy link
Author

Thanks for your feedback. You can set trainer.no_partial_bn = True if batch size >= 6 in each gpu and retry it, this will not affect the accuracy. That module exists some bug with distributed training, we will fix it quickly.

Thanks for the reply, but it doesn't work for my problem.
When I set no_partial_bn = True, the log file stop at 'save_dir: checkpoint/' and with no update again. and the usage is still about 800~900M.

The changed settings in my YAML file are only dataset related:
root_dir: train: meta_file: /home/renb/project/action_recognition/X-Temporal/data_labels/sthv1/train_videofolder.txt val: meta_file: /home/renb/project/action_recognition/X-Temporal/data_labels/sthv1/val_videofolder.txt test: meta_file: /home/renb/project/action_recognition/X-Temporal/data_labels/sthv1/test_videofolder.txt
Very confused about this.

Thanks again and waiting for you suggestion.

@deepcs233
Copy link
Collaborator

I test it in my env, it's ok. Could you provide your training script? and the gpu nums in default yaml is 8, you may need to check up with it. The last thing, with your img_prefix, root_dir and meta_file, the program can find the correct img path. I noticed that you only modified root_dir and meta_file, maybe img_prefix also needs to be modified.

@Amazingren
Copy link
Author

Amazingren commented Apr 10, 2020

I test it in my env, it's ok. Could you provide your training script? and the gpu nums in default yaml is 8, you may need to check up with it. The last thing, with your img_prefix, root_dir and meta_file, the program can find the correct img path. I noticed that you only modified root_dir and meta_file, maybe img_prefix also needs to be modified.

Thanks, it works for me when I modified 'img_prefix' : from 'image_{:05d}.jpg' to '{:05d}.jpg';
What's more, maybe num_class should be 174 instead of 102 in your default yaml file.

and, another problem appear when I run ./train.sh
and the error message is as follows:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that $
our module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused pa$
ameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`.
If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in t$
e return value of your module's `forward` function. Please include the structure of the return value of `forward` of your modu$
e when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1556653215914/w$
rk/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f2d1a4d5dc5 in /home/renb/anaconda3/lib/python$
.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable>
> const&) + 0x5ff (0x7f2d4084bbbf in /home/renb/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x6cb6c8 (0x7f2d408416c8 in /home/renb/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_
python.so)
frame #3: <unknown function> + 0x12d07a (0x7f2d402a307a in /home/renb/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_
python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5623adb79c34 in /home/renb/anaconda3/bin/python)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x5623adb79d51 in /home/renb/anaconda3/bin/python)
frame #10: PyObject_Call + 0x6e (0x5623adb3ba3e in /home/renb/anaconda3/bin/python)
frame #40: PyRun_SimpleStringFlags + 0x3f (0x5623adc4bd6f in /home/renb/anaconda3/bin/python)
frame #41: <unknown function> + 0x235e6d (0x5623adc4be6d in /home/renb/anaconda3/bin/python)
frame #42: _Py_UnixMain + 0x3c (0x5623adc4c1ec in /home/renb/anaconda3/bin/python)
frame #43: __libc_start_main + 0xf0 (0x7f2d4e9f7830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #44: <unknown function> + 0x1daf7d (0x5623adbf0f7d in /home/renb/anaconda3/bin/python)

I also have no idea about this, then I found a similar issues in the url:
ultralytics/yolov3#404
but I don't know how to fix this.
Again sorry for my endless problems, and also thanks a lot.

@deepcs233
Copy link
Collaborator

The yaml is an example, the num_classes is not for sth especially. This error comes from no_partial_bn, do you set no_partial_bn = True? Or you can git pull, I just update the code.

@Amazingren
Copy link
Author

The yaml is an example, the num_classes is not for sth especially. This error comes from no_partial_bn, do you set no_partial_bn = True? Or you can git pull, I just update the code.

Cool! It works when I set no_partial_bn=True! ( I had turn it back to False when I try to fix other problems)
Thanks a lot for your patience about my problems! Good Night!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants