Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I did not reproduce the scores in the paper, what is your environment when training? #5

Closed
daidaiershidi opened this issue Jun 30, 2022 · 2 comments

Comments

@daidaiershidi
Copy link

Thank you for proposing a very interesting work. On Charades, since the original number of GPUs is 4 and the original batchsize is 48,
I set batchsize as 24 in two 3090 for keeping the same samples on each GPU. Other configurations remain the same. However, I get the score are

R@1,IoU@0..5 = 45.35 (47.31 in paper)
R@1,IoU@0..7 = 26.30 (27.28 in paper)
R@5,IoU@0..5 = 84.21 (83.74 in paper)
R@5,IoU@0..7 = 57.02 (58.41 in paper)

The excessive gap confuses me. So, what was your training environment, and if I don't have 4 GPUs, is there any way to get the score in the paper? Looking forward to your reply.

@zhenzhiwang
Copy link
Collaborator

zhenzhiwang commented Jun 30, 2022

Hi, thank you for your interest of our work. I think the number of iterations of your configuration is twice of it in our original configuration, so I believe the solution will be: 1) reduce the total epochs, as well as the number of epoch when freezing BERT, deleting contrastive loss, etc; or 2) accumulate gradients and update optimizer each two steps with a normalization term of losses (e.g., multiple 1/2). Note that Charades is the smallest dataset in this task, so a little performance fluctuation is common. I believe performance gap less than 0.5 will be a good reproduction. For further questions, please feel free to comment here.

@daidaiershidi
Copy link
Author

I have adopted the second suggestion you provided ('accumulate gradients and update optimizer each two steps with a normalization term of losses (e.g., multiple 1/2)'). At the same time, gradient accumulation is often accompanied by improved learning rates. Because the number of rounds of gradient accumulation is 2, learning_rate = original_learning_rate * sqrt(2). Finally, I get similar results. Thank you for your help. It has taught me a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants