Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to reproduce the results listed in the paper with my trained model #7

Closed
LigZhong opened this issue May 3, 2022 · 8 comments

Comments

@LigZhong
Copy link

LigZhong commented May 3, 2022

I met a problem of mode collapse when step number is larger than 300K, and with the final model I got, I am not able to reproduce the result shown int the paper. Can you give your loss curve? @Paper99

@Paper99
Copy link
Collaborator

Paper99 commented May 4, 2022

Due to the offset overflow of the deformable convolution, the training process may be collapsed.
Such issue also occurs in other works that use DCN.
Please refer to this issue for more details.

However, we never met this problem when we trained the final model (which takes 500K iters).
The collapsed process may lead to an undesired model, whose result is far from that of our released model.
I suggest that you could retrain it or select an early safer checkpoint to resume training.

For convenience, we provide our loss curves as follows:

image

I hope these could help you.

@sydney0zq
Copy link

sydney0zq commented Sep 1, 2022

Due to the offset overflow of the deformable convolution, the training process may be collapsed. Such issue also occurs in other works that use DCN. Please refer to this issue for more details.

However, we never met this problem when we trained the final model (which takes 500K iters). The collapsed process may lead to an undesired model, whose result is far from that of our released model. I suggest that you could retrain it or select an early safer checkpoint to resume training.

For convenience, we provide our loss curves as follows:

image

I hope these could help you.

@Paper99 I also encounter this problem on my machine, which GPU card did you use for training? I use V100 32G, and it will collapse at about 300k iter.

@Paper99
Copy link
Collaborator

Paper99 commented Sep 2, 2022

Hi, we use 8 V100 (16G) GPUs or 8 1080ti GPUs to train our model.

@sydney0zq
Copy link

@Paper99 How do you suggest to solve the collapse problem? If we do clipping on the DCN module's weight, I cannot confirm the range ... Do we have another replaceable module to avoid the issue?

@sydney0zq
Copy link

sydney0zq commented Sep 5, 2022

@Paper99 Hello, how do you select the final release checkpoint, The 50w iter checkpoint or select the best among several final instances?

@Paper99
Copy link
Collaborator

Paper99 commented Sep 6, 2022

Just choose the best.

@MasterHow
Copy link

same question.

@jiahui1688
Copy link

@Paper99 How do you suggest to solve the collapse problem? If we do clipping on the DCN module's weight, I cannot confirm the range ... Do we have another replaceable module to avoid the issue?

Hi, Is the problem solved? I have the same problem. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants