New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the motivation of BE #1
Comments
Hi Licai, Thanks for your interest in our work. Spatial cropping is to make sure the main actor region remain similar for both anchor and pos pairs ; The key optimize direction of triplet/contrastive loss is two terms: one is [a,p] and the other is [a,n]. While BE introduce distract into pos, it's hard to tell them close just from spatial low-level cues. In addition, the negative pairs (intra-video and inter-video) are also included. Notice the intra-video negative share more similar appearance than pos with anchor. Yours, |
Hi Awiny, Thanks for your reply! I overlooked the role of the hard intra-video negative introduced in your paper. With the existence of the hard intra-video negative (has more different motion pattern compared to the positive), the model could be forced to focus on the motion information to attract the achor and the postive and repel the anchor and the negative. However, since the anchor and the postive are from the same clip while the inra-video negative is randomly sampled in the same video, I believe that the intra-video negtive shares more dissimilar apperance with the anchor than that of the postive although the positive is mixed by a static frame from itself. By the way, did you ablate the hard intra-video negative (i.e., train BE without it)? It seems that there is no relevant experiment in the paper. Best, |
Hi Licai, Sorry for late reply.
Yours, |
Hi jinpeng,
I noticed that the "pt_spatial_size"(i.e., cropping size) for the anchor and the postive is large (112/128 or 224/256) in the code of train transform, which means that they will overlap greatly after cropping. Since the motivation of BE is to force the model to focus on the dynamic motion information by creating a distracting anchor for the postive , while large overlapping will naturally make the distance of their representations close. Because the model could easily pulls the anchor and the postive by only focusing on the shared appearance/background information in the overlapped area and does not need to capture the dynamic motion information. What do you think of this?
Thanks!
Best,
Licai
The text was updated successfully, but these errors were encountered: