Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the motivation of BE #1

Closed
youcaiSUN opened this issue Mar 24, 2021 · 3 comments
Closed

Question about the motivation of BE #1

youcaiSUN opened this issue Mar 24, 2021 · 3 comments

Comments

@youcaiSUN
Copy link

Hi jinpeng,

I noticed that the "pt_spatial_size"(i.e., cropping size) for the anchor and the postive is large (112/128 or 224/256) in the code of train transform, which means that they will overlap greatly after cropping. Since the motivation of BE is to force the model to focus on the dynamic motion information by creating a distracting anchor for the postive , while large overlapping will naturally make the distance of their representations close. Because the model could easily pulls the anchor and the postive by only focusing on the shared appearance/background information in the overlapped area and does not need to capture the dynamic motion information. What do you think of this?

Thanks!

Best,
Licai

@FingerRec
Copy link
Owner

Hi Licai,

Thanks for your interest in our work. Spatial cropping is to make sure the main actor region remain similar for both anchor and pos pairs ; The key optimize direction of triplet/contrastive loss is two terms: one is [a,p] and the other is [a,n]. While BE introduce distract into pos, it's hard to tell them close just from spatial low-level cues. In addition, the negative pairs (intra-video and inter-video) are also included. Notice the intra-video negative share more similar appearance than pos with anchor.

Yours,
Awiny

@youcaiSUN
Copy link
Author

Hi Awiny,

Thanks for your reply! I overlooked the role of the hard intra-video negative introduced in your paper. With the existence of the hard intra-video negative (has more different motion pattern compared to the positive), the model could be forced to focus on the motion information to attract the achor and the postive and repel the anchor and the negative. However, since the anchor and the postive are from the same clip while the inra-video negative is randomly sampled in the same video, I believe that the intra-video negtive shares more dissimilar apperance with the anchor than that of the postive although the positive is mixed by a static frame from itself.

By the way, did you ablate the hard intra-video negative (i.e., train BE without it)? It seems that there is no relevant experiment in the paper.

Best,
Licai

@FingerRec
Copy link
Owner

FingerRec commented Mar 27, 2021

Hi Licai,

Sorry for late reply.

  1. Most of the videos generated by BE are mess (especially with camera motion), it's hard to say that the anchor is similar to positive; I provide some example without augmentation but with less camera-motion in figures. You can also generate this with triplet_visualization.py
  2. The sample strategy of the intra-video negative depend on dataset. e.g. on UCF101, negative_indexe > anchor_index+2 which means part of them may be same.
  3. Hard intra-video negative should lead to around 3% improvement on HMDB51 with K400 pt.

Yours,
Awiny

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants