Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How many gpu days does the training procedure of AVQA take? #8

Closed
Rainlt opened this issue Mar 23, 2023 · 4 comments
Closed

How many gpu days does the training procedure of AVQA take? #8

Rainlt opened this issue Mar 23, 2023 · 4 comments

Comments

@Rainlt
Copy link

Rainlt commented Mar 23, 2023

Hello, I'm intrested in your pretty work and trying to reproduce the result. But I found that I have to spend nearly 5 gpu days to train for the AVQA task on 1 3090 gpu. Is this normal?
This is the recorded time during training:

feature Embed time:  0.0016405582427978516                                                                                                                
time for posi encode:  0.26480627059936523                                                                                                                
time for nega encode:  0.09544777870178223                                                                                                                
time for grounding:  0.009487152099609375                                                                                                                 
time for result:  0.0050661563873291016

It can be seen that encoding one audio and positive visual sample using swin transformer with adapter spend 0.2s. So it will take 2 gpu days just to encode the positive feature for 30 epochs.

@GenjiB
Copy link
Owner

GenjiB commented Mar 24, 2023

Actually, I got the best results 13 epochs. It took about 12 day with one A5000, You can also try not to use positive and negative sampling (we did not study the effectiveness of the sampling. But, I believe the results would be similar).

image

@GenjiB GenjiB closed this as completed Mar 24, 2023
@Rainlt
Copy link
Author

Rainlt commented Mar 24, 2023

can also try not to use positive and negative sampling (we did not study the effectiveness of the sampling. But, I believe the results would be similar)

Thanks for your kind response. I can see your result is 77 in the end of epoch 13, but I can only reach 76.0 there. I have some idea about that. Could you please check something for me:

  1. Did you change the random seed? (default is 1)
  2. Have you modified the audio wave? Because some of them is shorter than 60 seconds, so I have padded them to 60s as the original paper said.
  3. There are some bugs about the dimension of image and feature. I have fixed them selfishly but for I have also add lots of note in the code and I'm not proficient in git, so I didn't pull requests. Eg. The swin_v2 need the input with dim [192, 192], so the Resize function in line 86 in dataloader_avst.py should be changed from [224, 224] to [192, 192]. Another, maybe the f_v should be assigned to visual_posi in line 375 in net_avst.py

@Rainlt
Copy link
Author

Rainlt commented Mar 24, 2023

Actually, I got the best results 13 epochs. It took about 12 day with one A5000, You can also try not to use positive and negative sampling (we did not study the effectiveness of the sampling. But, I believe the results would be similar).

image

Oh, another, have you pretrained the grounding module as the original code? I haven't done it and I thought that for the backbone have been changed, the pretrained parameter may be useless, so I commented out the loading code in line 227 in main_avst.py. I think this may be the main cause of this difference!

@GenjiB
Copy link
Owner

GenjiB commented Mar 25, 2023

@Rainlt Thanks for pointing that out. I found we did use the pretrained grounding module. Gonna fix this bug soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants