Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regarding hierarchical representation pattern #2

Closed
seyeeet opened this issue Nov 5, 2021 · 4 comments
Closed

regarding hierarchical representation pattern #2

seyeeet opened this issue Nov 5, 2021 · 4 comments

Comments

@seyeeet
Copy link

seyeeet commented Nov 5, 2021

In the paper it mentioned that

To demonstrate its effectiveness, we conduct a non-hierarchical version by setting the size of the local window in all attention layers to 512, which is the largest window size in the last encoder/decoder block (out of memory if we directly set the size of the local window in all attention layers to the video length).
I was not able to figure out where in the code the attention layer has windows of 512. can you please point me to the right direction?

@ChinaYi
Copy link
Owner

ChinaYi commented Nov 6, 2021

In libs/models/tcn.py(L472-L473),

self.layers = nn.ModuleList(
            [AttModule(2 ** i, num_f_maps, num_f_maps, r1, r2, att_type, 'encoder', alpha) for i in # 2**i
             range(num_layers)])

where num_layers=10, which means that the last attention module has the windows of 512. If you want to reproduce the ablation study, just replace the 2**i with 512, so that each layer will have windows of 512.

@ChinaYi ChinaYi closed this as completed Nov 6, 2021
@seyeeet
Copy link
Author

seyeeet commented Nov 6, 2021

@ChinaYi thank you for your answer,

  • should windows of 512 works better or worse than 2**i in your opinion?

  • Also one more thing, should I use sliding attention option to achieve the best results in the paper?

  • finally, can you please let me know the parameters that is used to achieve the best performance for encoder and decoder. i mean te parameters that can lead to the best results in the paper. I notice the performance drops slightly when I go with the current default settings

@ChinaYi
Copy link
Owner

ChinaYi commented Nov 7, 2021

  • windows of 512 is worse than 2**i in our experiments.
  • Yes. sliding window approach is slightly better than block-wise approach.
  • The current setting is what I used. The reason that the performance is possibly due to the unstable training process in ASRF due to the boundary prediction. I strongly recommend you to pick the best model according to the validation set instead of directly use the model from 80 epoch. By the way, if you want to get out of the trivial param search, I recommend you to use the pure ASformer in https://github.com/ChinaYi/ASFormer , where the training process of our pure ASFormer is very stable and not sensitive to the training epochs.

@seyeeet
Copy link
Author

seyeeet commented Nov 7, 2021

thanks for the hints! appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants