How to align the audio and video at the clip level #18

Lynneyyq · 2022-06-09T08:10:40Z

Your paper says "Visual and audio features are temporally aligned at clip level". For example, in the YouTube Highlight dataset, the video is divided into clips every 100 frames, and the overlap is 50%. I extract audio features through the codebase you provided.
How to align the audio and video at the clip level? How did you do it?

Lynneyyq · 2022-06-09T14:01:11Z

The above two pictures print the feature dimensions of video and audio respectively. How to align the audio and video marked in the red box?
Thank you very much for taking the time to answer.

Mofafa · 2022-06-10T08:22:37Z

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

Lynneyyq · 2022-06-10T09:49:20Z

Thank you for your reply!
At the beginning, I also considered the time is fixed. However, the reference codebase you provided does not involve related parameters.
The PANN codebase seems to extract features from the spectrogram of sound. Did not see some parameters related to the clip time index.
If I extract features from the audio according to the time clip, PANN will get a feature map of n*2048.
Ideally, each time clip should get a 1*2048 feature vector, which exactly corresponds to the feature matrix of the video. How did you do it?
Please forgive me for asking this question. Thank you very much!

Lynneyyq · 2022-06-13T02:25:14Z

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

Long audio is also divided into multiple clips for alignment with video, such as audio clip time index = [0,2],[1,3],[2,4],[3,5]. For each audio clip, first calculate the spectrogram with STFT, and then use convolution to extract features, the feature matrix is n*2048. But only the feature vector 1*2048 obtained for each audio clip can be aligned with the video. Is it reasonable to average or maximize the feature matrix n*2048 of each audio clip to get 1*2048?
Audio and video alignment, how did you do it?

Mofafa · 2022-06-13T02:30:45Z

Thank you for your reply! At the beginning, I also considered the time is fixed. However, the reference codebase you provided does not involve related parameters. The PANN codebase seems to extract features from the spectrogram of sound. Did not see some parameters related to the clip time index. If I extract features from the audio according to the time clip, PANN will get a feature map of n*2048. Ideally, each time clip should get a 1*2048 feature vector, which exactly corresponds to the feature matrix of the video. How did you do it? Please forgive me for asking this question. Thank you very much!

All the feature vectors belonging to the same clip are averaged. In this case, we calculate the average of the first dimension.

hpppppp8 · 2022-06-20T11:53:37Z

Thank you for your reply! At the beginning, I also considered the time is fixed. However, the reference codebase you provided does not involve related parameters. The PANN codebase seems to extract features from the spectrogram of sound. Did not see some parameters related to the clip time index. If I extract features from the audio according to the time clip, PANN will get a feature map of n*2048. Ideally, each time clip should get a 1*2048 feature vector, which exactly corresponds to the feature matrix of the video. How did you do it? Please forgive me for asking this question. Thank you very much!

All the feature vectors belonging to the same clip are averaged. In this case, we calculate the average of the first dimension.

In AudioTagging process:
a 150 sec video [1, 4798625] -> [1, 1, 14996, 513] -> [1, 1, 14996, 64] -> [1, 64, 7498, 32] ->(some Conv process) -> and final got [1, 2048, 468].
You means 468 to calculate the average according to 2 sec clip?
How could you do that?Or is there something wrong with my process?

hpppppp8 · 2022-06-20T12:02:16Z

I also thought about the simple way, input the 2 sec clip to the model, but it seems time consume to me to get the audio feature.

yeliudev · 2022-06-20T14:17:55Z

The [1, 2048, 468] matrix should be the sequence of audio features for the whole 150s video. You may divide them into 2s clips, each with the shape [1, 2048, 6] or [1, 2048, 7], and average them into [1, 2048].

Lynneyyq changed the title ~~How audio and video are aligned at the clip level~~ How to align the audio and video at the clip level Jun 9, 2022

yeliudev mentioned this issue Jun 13, 2022

how to align the audio feature and video feature? #17

Closed

yeliudev closed this as completed Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to align the audio and video at the clip level #18

How to align the audio and video at the clip level #18

Lynneyyq commented Jun 9, 2022 •

edited

Lynneyyq commented Jun 9, 2022

Mofafa commented Jun 10, 2022

Lynneyyq commented Jun 10, 2022

Lynneyyq commented Jun 13, 2022 •

edited

Mofafa commented Jun 13, 2022

hpppppp8 commented Jun 20, 2022

hpppppp8 commented Jun 20, 2022

yeliudev commented Jun 20, 2022

How to align the audio and video at the clip level #18

How to align the audio and video at the clip level #18

Comments

Lynneyyq commented Jun 9, 2022 • edited

Lynneyyq commented Jun 9, 2022

Mofafa commented Jun 10, 2022

Lynneyyq commented Jun 10, 2022

Lynneyyq commented Jun 13, 2022 • edited

Mofafa commented Jun 13, 2022

hpppppp8 commented Jun 20, 2022

hpppppp8 commented Jun 20, 2022

yeliudev commented Jun 20, 2022

Lynneyyq commented Jun 9, 2022 •

edited

Lynneyyq commented Jun 13, 2022 •

edited