Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature extraction (i3d and optical flow) #7

Closed
Lvqin001 opened this issue May 16, 2022 · 16 comments
Closed

feature extraction (i3d and optical flow) #7

Lvqin001 opened this issue May 16, 2022 · 16 comments

Comments

@Lvqin001
Copy link

Hello, I would like to ask which code base do you use for i3d feature extraction and optical flow feature extraction mentioned in the data set paper? I want to reproduce it and then test my video.

@yeliudev
Copy link
Member

yeliudev commented May 17, 2022

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features of Charades-STA are pre-extracted and officially released in the Charades dataset.

@G-Apple1
Copy link

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features are only used in Charades-STA, and they are pre-extracted and officially released in the Charades dataset.

For video feature extraction, do you use denseflow extract the rgb and optical flow firstly?

@yeliudev
Copy link
Member

@G-Apple1 No we didn't. This codebase can extract visual features directly from raw videos.

@G-Apple1
Copy link

For the feature extraction, i want to know more details.
A .wav audio file how to become a 2048d feature.
A .mp4 video file how to become two 1024d feature.
I think you should add this in the readme. It will be useful for us to apply your project in our datasets.

@yeliudev
Copy link
Member

Thanks for your suggestions. We may release the demo script in the future. But in this phase, if you would like to run the model on your own datasets, we strongly recommend you to extract the features again on public datasets (e.g. QVHighlights, Charades-STA) using your own feature extractor and re-train our model on it (only a single GPU and a few hours are needed), as we also obtained the features from their authors and don't know the exact feature extraction pipeline as well.

@G-Apple1
Copy link

Thanks for your suggestions. We may release the demo script in the future. But in this phase, if you would like to run the model on your own datasets, we strongly recommend you to extract the features again on public datasets (e.g. QVHighlights, Charades-STA) using your own feature extractor and re-train our model on it (only a single GPU and a few hours are needed), as we also obtained the features from their authors and don't know the exact feature extraction pipeline as well.

OK, thank you for your prompt reply.

@Lvqin001
Copy link
Author

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features are only used in Charades-STA, and they are pre-extracted and officially released in the Charades dataset.

In your paper, you said to calculate a feature for 32 consecutive frames. Is it necessary to first extract features (number n) from the whole video (frame n) and then calculate the feature average of the frames in clip (frames 0-100)? Or do you want to calculate a feature for 32 consecutive segments extracted from clip (frames 0-100)?

@yeliudev
Copy link
Member

yeliudev commented May 19, 2022

@Lvqin001 What do you mean by segments and clip? In our case, each feature vector contains the features of 32 consecutive frames, for example, a 160-frame video can be represented by 5 feature vectors. But each clip in the annotations may not be temporally aligned with each feature vector, that is, a clip is 2s long and may not exactly contain 32 frames, depending on the video's frame rate. So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.

@Lvqin001
Copy link
Author

@Lvqin001 What do you mean by segments and clip? In our case, each feature vector contains the features of 32 consecutive frames, for example, a 160-frame video can be represented by 5 feature vectors. But each clip in the annotations may not be temporally aligned with each feature vector, that is, a clip is 2s long and may not exactly contain 32 frames, depending on the video's frame rate. So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.

For example, a clip may contain 3*32 frames, so it contains three feature vectors. Are these three feature vectors averaged?

@yeliudev
Copy link
Member

You are right. All the feature vectors belonging to the same clip are averaged.

@Lvqin001
Copy link
Author

You are right. All the feature vectors belonging to the same clip are averaged.

OK, thank you for your guidance.

@G-Apple1
Copy link

G-Apple1 commented Jun 8, 2022

@Lvqin001 What do you mean by segments and clip? In our case, each feature vector contains the features of 32 consecutive frames, for example, a 160-frame video can be represented by 5 feature vectors. But each clip in the annotations may not be temporally aligned with each feature vector, that is, a clip is 2s long and may not exactly contain 32 frames, depending on the video's frame rate. So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.

So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.
what do you mean by the sentence?

@Lynneyyq
Copy link

Lynneyyq commented Sep 28, 2022

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features are only used in Charades-STA, and they are pre-extracted and officially released in the Charades dataset.

You said, "the optical flow features are only used in Charades-STA", I3D extracted the optical flow and RGB features in YouTube Highlights and TVSum, the code used torch.cat(video, optic). What is the purpose of splicing here? What does optical flow do here?
Thank you very much!

@yeliudev
Copy link
Member

yeliudev commented Sep 28, 2022

@Lynneyyq Sorry for the mistake and thanks for pointing it out. We've double-checked the code and data. Both YouTube Highlights and TVSum use optical flow features as well.

@Lynneyyq
Copy link

Dear author, thanks a lot for your guidance. I have a few more questions.
Figure 3 in the paper shows the architecture of bottleneck transformer module. In this figure, the red box 1 is the bottleneck tokens {zi} to capture the compressed features from all modalities.
(1) What is the meaning of "all modalities" here? The paper says "Here Nb is a number much smaller than the number of video clips Nv." So bottleneck tokens {zi} are compressed features for each modality? Or concatenation of two modal compressed features? In red box 3, is {zi}' the fused feature?

(2) The paper compresses the original features by limiting the number of bottleneck tokens. So which features to drop and which to keep? I don't quite understand. Are there any rules?

(3) In feature expansion, the feature is extended to each modality. What is the purpose here? Where does the final fusion features?
image

@yeliudev
Copy link
Member

yeliudev commented Oct 1, 2022

@Lynneyyq

  1. 'All modalities' means the video and audio features. The bottleneck tokens (red box 1) are learnable vectors from the very beginning, and they can serve as query tokens in the attention modules so that these bottleneck tokens are updated by summing up the compressed features (from video and audio) and themselves (red box 2 & 3).
  2. This process is end-to-end trainable, which means the model can learn which to keep and which to drop by itself.
  3. Bottleneck tokens are only the bridge between two modalities, the compressed features should be added back to each stream. The final fusion features can be the sum of the two-stream output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants