-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training hangs at the end of the first epoch in image classification task. #33
Comments
I also met these warnings. But it does not affect the performance, just ignore it~ |
You can simply check the |
Thanks for your kindly response. |
However, the training really hang :(
|
It seems that you should increase share memory? |
See here. You can try to use a smaller batch size if you train UniFormer-B. Note that the learning rate should be adjusted. |
Thanks for your advice first. However, it might not be the reason of the shared memory. My server has a shared memory of more than 30G. I have watched the usage of shared memory when training, but only less than 500M is used throughout the whole stage. I have also tried many different possible solutions from the Internet but never worked. The training is always stuck at the end of the first epoch, and before or after the end of the last batch under Pytorch 1.7.0:
or
I think something might be wrong when processes synchronizing, but no idea to fix that. UniFormer/image_classification/utils.py Lines 35 to 47 in e802470
|
About the messy warnings: It might be the reason of the version of PyTorch. I reinstall my virtual environment of PyTorch 1.7.0, and all warnings and errors are disappeared. However, instead of being interrupted, the training is always stuck at the end of the first epoch. |
There is another strange case worth noticing. When I use a smaller dataset (about 1000 images) to debug, anything goes well.
at Line33 UniFormer/image_classification/engine.py Lines 30 to 33 in e802470
anything goes well. I really don't know what's wrong with my training. |
It looks so strange. I never meet this bug.... |
Maybe you can check the CUDA, CUDNN, Pytorch, Torchvision versions.... I suggest you try DeiT first and check the log~~ |
It's really strange, and I have never met similar questions before. I have debugged the code for several days but still failed to run the network. |
I have tried Pytorch 1.7 or Pytorch 1.9 for Cuda 11. Considering the DeiT is an earlier work, using Pytorch 1.9 or Cuda 11 might not be a good idea? Did you remember your versions of CUDA and CUDNN? |
Come on! Try DeiT and see if you will meet the same bugs. |
In my environment, torch1.6-1.10 is okay for DeiT. |
Hi! Have you solved your problem? |
Thanks a lot. I had solved the problem by following your kind advice. |
Dear author:
When I training the Uniformer model with 8 GPUs, I start the code with the following
run.sh
:And the logs are (I have deleted the display of model details):
The displayed logs are very messy and maddening, and I am totally no clue whether the code runs correctly. These warnings only occur when distributed training. Have you ever met this situation? I will appreciate it if you can give me some advice.
The text was updated successfully, but these errors were encountered: