./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... #1

Amazingren · 2020-04-10T05:06:29Z

Thanks for your nice style codebase.
However, when I try to train TSM in your codebase, there is a problem which stoped me from training it.
（1）The log file stop at: 2020-04-10 xxxx094-models.py#177: Freezing BatchNorm2D except the first one, and I wait it for 10 min but with no continue update.
（2）When I use 'gpustat' check the usage of gpu, it shows only about 800M data in each gpu(I use 8 in total)

I am sorry for disturbing you, while as a green hand also would be appreaciate if you could show me some light.

The text was updated successfully, but these errors were encountered:

deepcs233 · 2020-04-10T07:31:53Z

Thanks for your feedback. You can set trainer.no_partial_bn = True if batch size >= 6 in each gpu and retry it, this will not affect the accuracy. That module exists some bug with distributed training, we will fix it quickly.

Amazingren · 2020-04-10T12:42:19Z

Thanks for your feedback. You can set trainer.no_partial_bn = True if batch size >= 6 in each gpu and retry it, this will not affect the accuracy. That module exists some bug with distributed training, we will fix it quickly.

Thanks for the reply, but it doesn't work for my problem.
When I set no_partial_bn = True, the log file stop at 'save_dir: checkpoint/' and with no update again. and the usage is still about 800~900M.

The changed settings in my YAML file are only dataset related:
root_dir: train: meta_file: /home/renb/project/action_recognition/X-Temporal/data_labels/sthv1/train_videofolder.txt val: meta_file: /home/renb/project/action_recognition/X-Temporal/data_labels/sthv1/val_videofolder.txt test: meta_file: /home/renb/project/action_recognition/X-Temporal/data_labels/sthv1/test_videofolder.txt
Very confused about this.

Thanks again and waiting for you suggestion.

deepcs233 · 2020-04-10T13:14:56Z

I test it in my env, it's ok. Could you provide your training script? and the gpu nums in default yaml is 8, you may need to check up with it. The last thing, with your img_prefix, root_dir and meta_file, the program can find the correct img path. I noticed that you only modified root_dir and meta_file, maybe img_prefix also needs to be modified.

Amazingren · 2020-04-10T15:13:09Z

I test it in my env, it's ok. Could you provide your training script? and the gpu nums in default yaml is 8, you may need to check up with it. The last thing, with your img_prefix, root_dir and meta_file, the program can find the correct img path. I noticed that you only modified root_dir and meta_file, maybe img_prefix also needs to be modified.

Thanks, it works for me when I modified 'img_prefix' : from 'image_{:05d}.jpg' to '{:05d}.jpg';
What's more, maybe num_class should be 174 instead of 102 in your default yaml file.

and, another problem appear when I run ./train.sh
and the error message is as follows:

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that $
our module has parameters that were not used in producing its output (the return value of `forward`). You can enable unused pa$
ameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`.
If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in t$
e return value of your module's `forward` function. Please include the structure of the return value of `forward` of your modu$
e when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1556653215914/w$
rk/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f2d1a4d5dc5 in /home/renb/anaconda3/lib/python$
.7/site-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable>
> const&) + 0x5ff (0x7f2d4084bbbf in /home/renb/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #2: <unknown function> + 0x6cb6c8 (0x7f2d408416c8 in /home/renb/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_
python.so)
frame #3: <unknown function> + 0x12d07a (0x7f2d402a307a in /home/renb/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_
python.so)
frame #4: _PyMethodDef_RawFastCallKeywords + 0x264 (0x5623adb79c34 in /home/renb/anaconda3/bin/python)
frame #5: _PyCFunction_FastCallKeywords + 0x21 (0x5623adb79d51 in /home/renb/anaconda3/bin/python)
frame #10: PyObject_Call + 0x6e (0x5623adb3ba3e in /home/renb/anaconda3/bin/python)
frame #40: PyRun_SimpleStringFlags + 0x3f (0x5623adc4bd6f in /home/renb/anaconda3/bin/python)
frame #41: <unknown function> + 0x235e6d (0x5623adc4be6d in /home/renb/anaconda3/bin/python)
frame #42: _Py_UnixMain + 0x3c (0x5623adc4c1ec in /home/renb/anaconda3/bin/python)
frame #43: __libc_start_main + 0xf0 (0x7f2d4e9f7830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #44: <unknown function> + 0x1daf7d (0x5623adbf0f7d in /home/renb/anaconda3/bin/python)

I also have no idea about this, then I found a similar issues in the url:
ultralytics/yolov3#404
but I don't know how to fix this.
Again sorry for my endless problems, and also thanks a lot.

deepcs233 · 2020-04-10T15:39:25Z

The yaml is an example, the num_classes is not for sth especially. This error comes from no_partial_bn, do you set no_partial_bn = True? Or you can git pull, I just update the code.

Amazingren · 2020-04-10T16:33:22Z

The yaml is an example, the num_classes is not for sth especially. This error comes from no_partial_bn, do you set no_partial_bn = True? Or you can git pull, I just update the code.

Cool! It works when I set no_partial_bn=True! ( I had turn it back to False when I try to fix other problems)
Thanks a lot for your patience about my problems! Good Night!

deepcs233 closed this as completed Apr 10, 2020

deepcs233 mentioned this issue Dec 28, 2020

When training，the log file stop at 'save_dir: checkpoint/' and with no update again #20

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... #1

./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... #1

Amazingren commented Apr 10, 2020

deepcs233 commented Apr 10, 2020

Amazingren commented Apr 10, 2020

deepcs233 commented Apr 10, 2020

Amazingren commented Apr 10, 2020 •

edited

Loading

deepcs233 commented Apr 10, 2020

Amazingren commented Apr 10, 2020

./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... #1

./train.sh for TSM stop at the first log infor. : Freezing BatchNorm2D except... #1

Comments

Amazingren commented Apr 10, 2020

deepcs233 commented Apr 10, 2020

Amazingren commented Apr 10, 2020

deepcs233 commented Apr 10, 2020

Amazingren commented Apr 10, 2020 • edited Loading

deepcs233 commented Apr 10, 2020

Amazingren commented Apr 10, 2020

Amazingren commented Apr 10, 2020 •

edited

Loading