Question about the motivation of BE #1

youcaiSUN · 2021-03-24T15:34:03Z

Hi jinpeng,

I noticed that the "pt_spatial_size"(i.e., cropping size) for the anchor and the postive is large (112/128 or 224/256) in the code of train transform, which means that they will overlap greatly after cropping. Since the motivation of BE is to force the model to focus on the dynamic motion information by creating a distracting anchor for the postive , while large overlapping will naturally make the distance of their representations close. Because the model could easily pulls the anchor and the postive by only focusing on the shared appearance/background information in the overlapped area and does not need to capture the dynamic motion information. What do you think of this?

Thanks!

Best,
Licai

FingerRec · 2021-03-25T06:33:34Z

Hi Licai,

Thanks for your interest in our work. Spatial cropping is to make sure the main actor region remain similar for both anchor and pos pairs ; The key optimize direction of triplet/contrastive loss is two terms: one is [a,p] and the other is [a,n]. While BE introduce distract into pos, it's hard to tell them close just from spatial low-level cues. In addition, the negative pairs (intra-video and inter-video) are also included. Notice the intra-video negative share more similar appearance than pos with anchor.

Yours,
Awiny

youcaiSUN · 2021-03-25T07:09:43Z

Hi Awiny,

Thanks for your reply! I overlooked the role of the hard intra-video negative introduced in your paper. With the existence of the hard intra-video negative (has more different motion pattern compared to the positive), the model could be forced to focus on the motion information to attract the achor and the postive and repel the anchor and the negative. However, since the anchor and the postive are from the same clip while the inra-video negative is randomly sampled in the same video, I believe that the intra-video negtive shares more dissimilar apperance with the anchor than that of the postive although the positive is mixed by a static frame from itself.

By the way, did you ablate the hard intra-video negative (i.e., train BE without it)? It seems that there is no relevant experiment in the paper.

Best,
Licai

FingerRec · 2021-03-27T11:32:28Z

Hi Licai,

Sorry for late reply.

Most of the videos generated by BE are mess (especially with camera motion), it's hard to say that the anchor is similar to positive; I provide some example without augmentation but with less camera-motion in figures. You can also generate this with triplet_visualization.py
The sample strategy of the intra-video negative depend on dataset. e.g. on UCF101, negative_indexe > anchor_index+2 which means part of them may be same.
Hard intra-video negative should lead to around 3% improvement on HMDB51 with K400 pt.

Yours,
Awiny

FingerRec closed this as completed Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the motivation of BE #1

Question about the motivation of BE #1

youcaiSUN commented Mar 24, 2021

FingerRec commented Mar 25, 2021

youcaiSUN commented Mar 25, 2021

FingerRec commented Mar 27, 2021 •

edited

Question about the motivation of BE #1

Question about the motivation of BE #1

Comments

youcaiSUN commented Mar 24, 2021

FingerRec commented Mar 25, 2021

youcaiSUN commented Mar 25, 2021

FingerRec commented Mar 27, 2021 • edited

FingerRec commented Mar 27, 2021 •

edited