feature extraction （i3d and optical flow） #7

Lvqin001 · 2022-05-16T05:59:16Z

Hello, I would like to ask which code base do you use for i3d feature extraction and optical flow feature extraction mentioned in the data set paper? I want to reproduce it and then test my video.

yeliudev · 2022-05-17T03:41:21Z

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features of Charades-STA are pre-extracted and officially released in the Charades dataset.

G-Apple1 · 2022-05-18T02:18:08Z

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features are only used in Charades-STA, and they are pre-extracted and officially released in the Charades dataset.

For video feature extraction, do you use denseflow extract the rgb and optical flow firstly?

yeliudev · 2022-05-18T02:23:14Z

@G-Apple1 No we didn't. This codebase can extract visual features directly from raw videos.

G-Apple1 · 2022-05-18T02:32:42Z

For the feature extraction, i want to know more details.
A .wav audio file how to become a 2048d feature.
A .mp4 video file how to become two 1024d feature.
I think you should add this in the readme. It will be useful for us to apply your project in our datasets.

yeliudev · 2022-05-18T03:39:38Z

Thanks for your suggestions. We may release the demo script in the future. But in this phase, if you would like to run the model on your own datasets, we strongly recommend you to extract the features again on public datasets (e.g. QVHighlights, Charades-STA) using your own feature extractor and re-train our model on it (only a single GPU and a few hours are needed), as we also obtained the features from their authors and don't know the exact feature extraction pipeline as well.

G-Apple1 · 2022-05-18T07:32:15Z

Thanks for your suggestions. We may release the demo script in the future. But in this phase, if you would like to run the model on your own datasets, we strongly recommend you to extract the features again on public datasets (e.g. QVHighlights, Charades-STA) using your own feature extractor and re-train our model on it (only a single GPU and a few hours are needed), as we also obtained the features from their authors and don't know the exact feature extraction pipeline as well.

OK, thank you for your prompt reply.

Lvqin001 · 2022-05-19T11:10:38Z

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features are only used in Charades-STA, and they are pre-extracted and officially released in the Charades dataset.

In your paper, you said to calculate a feature for 32 consecutive frames. Is it necessary to first extract features (number n) from the whole video (frame n) and then calculate the feature average of the frames in clip (frames 0-100)? Or do you want to calculate a feature for 32 consecutive segments extracted from clip (frames 0-100)?

yeliudev · 2022-05-19T15:28:47Z

@Lvqin001 What do you mean by segments and clip? In our case, each feature vector contains the features of 32 consecutive frames, for example, a 160-frame video can be represented by 5 feature vectors. But each clip in the annotations may not be temporally aligned with each feature vector, that is, a clip is 2s long and may not exactly contain 32 frames, depending on the video's frame rate. So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.

Lvqin001 · 2022-05-20T01:11:48Z

@Lvqin001 What do you mean by segments and clip? In our case, each feature vector contains the features of 32 consecutive frames, for example, a 160-frame video can be represented by 5 feature vectors. But each clip in the annotations may not be temporally aligned with each feature vector, that is, a clip is 2s long and may not exactly contain 32 frames, depending on the video's frame rate. So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.

For example, a clip may contain 3*32 frames, so it contains three feature vectors. Are these three feature vectors averaged?

yeliudev · 2022-05-20T01:35:40Z

You are right. All the feature vectors belonging to the same clip are averaged.

Lvqin001 · 2022-05-20T01:59:50Z

You are right. All the feature vectors belonging to the same clip are averaged.

OK, thank you for your guidance.

G-Apple1 · 2022-06-08T03:13:01Z

@Lvqin001 What do you mean by segments and clip? In our case, each feature vector contains the features of 32 consecutive frames, for example, a 160-frame video can be represented by 5 feature vectors. But each clip in the annotations may not be temporally aligned with each feature vector, that is, a clip is 2s long and may not exactly contain 32 frames, depending on the video's frame rate. So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.

So a feature vector can be regarded as belonging to a clip only when they have a temporal overlap more than 50%.
what do you mean by the sentence?

Lynneyyq · 2022-09-28T01:53:48Z

We used this codebase to extract I3D features for YouTube Highlights and TVSum. The optical flow features are only used in Charades-STA, and they are pre-extracted and officially released in the Charades dataset.

You said, "the optical flow features are only used in Charades-STA", I3D extracted the optical flow and RGB features in YouTube Highlights and TVSum, the code used torch.cat(video, optic). What is the purpose of splicing here? What does optical flow do here?
Thank you very much!

yeliudev · 2022-09-28T02:27:50Z

@Lynneyyq Sorry for the mistake and thanks for pointing it out. We've double-checked the code and data. Both YouTube Highlights and TVSum use optical flow features as well.

Lynneyyq · 2022-09-30T09:31:42Z

Dear author, thanks a lot for your guidance. I have a few more questions.
Figure 3 in the paper shows the architecture of bottleneck transformer module. In this figure, the red box 1 is the bottleneck tokens {zi} to capture the compressed features from all modalities.
(1) What is the meaning of "all modalities" here? The paper says "Here Nb is a number much smaller than the number of video clips Nv." So bottleneck tokens {zi} are compressed features for each modality? Or concatenation of two modal compressed features? In red box 3, is {zi}' the fused feature?

(2) The paper compresses the original features by limiting the number of bottleneck tokens. So which features to drop and which to keep? I don't quite understand. Are there any rules?

(3) In feature expansion, the feature is extended to each modality. What is the purpose here? Where does the final fusion features?

yeliudev · 2022-10-01T03:32:58Z

@Lynneyyq

'All modalities' means the video and audio features. The bottleneck tokens (red box 1) are learnable vectors from the very beginning, and they can serve as query tokens in the attention modules so that these bottleneck tokens are updated by summing up the compressed features (from video and audio) and themselves (red box 2 & 3).
This process is end-to-end trainable, which means the model can learn which to keep and which to drop by itself.
Bottleneck tokens are only the bridge between two modalities, the compressed features should be added back to each stream. The final fusion features can be the sum of the two-stream output.

Lynneyyq mentioned this issue May 20, 2022

How to prepare the data #9

Closed

yeliudev closed this as completed May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature extraction （i3d and optical flow） #7

feature extraction （i3d and optical flow） #7

Lvqin001 commented May 16, 2022

yeliudev commented May 17, 2022 •

edited

G-Apple1 commented May 18, 2022

yeliudev commented May 18, 2022

G-Apple1 commented May 18, 2022

yeliudev commented May 18, 2022

G-Apple1 commented May 18, 2022

Lvqin001 commented May 19, 2022

yeliudev commented May 19, 2022 •

edited

Lvqin001 commented May 20, 2022

yeliudev commented May 20, 2022

Lvqin001 commented May 20, 2022

G-Apple1 commented Jun 8, 2022

Lynneyyq commented Sep 28, 2022 •

edited

yeliudev commented Sep 28, 2022 •

edited

Lynneyyq commented Sep 30, 2022

yeliudev commented Oct 1, 2022

feature extraction （i3d and optical flow） #7

feature extraction （i3d and optical flow） #7

Comments

Lvqin001 commented May 16, 2022

yeliudev commented May 17, 2022 • edited

G-Apple1 commented May 18, 2022

yeliudev commented May 18, 2022

G-Apple1 commented May 18, 2022

yeliudev commented May 18, 2022

G-Apple1 commented May 18, 2022

Lvqin001 commented May 19, 2022

yeliudev commented May 19, 2022 • edited

Lvqin001 commented May 20, 2022

yeliudev commented May 20, 2022

Lvqin001 commented May 20, 2022

G-Apple1 commented Jun 8, 2022

Lynneyyq commented Sep 28, 2022 • edited

yeliudev commented Sep 28, 2022 • edited

Lynneyyq commented Sep 30, 2022

yeliudev commented Oct 1, 2022

yeliudev commented May 17, 2022 •

edited

yeliudev commented May 19, 2022 •

edited

Lynneyyq commented Sep 28, 2022 •

edited

yeliudev commented Sep 28, 2022 •

edited