Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to align the audio and video at the clip level #18

Closed
Lynneyyq opened this issue Jun 9, 2022 · 8 comments
Closed

How to align the audio and video at the clip level #18

Lynneyyq opened this issue Jun 9, 2022 · 8 comments

Comments

@Lynneyyq
Copy link

Lynneyyq commented Jun 9, 2022

Your paper says "Visual and audio features are temporally aligned at clip level". For example, in the YouTube Highlight dataset, the video is divided into clips every 100 frames, and the overlap is 50%. I extract audio features through the codebase you provided.
How to align the audio and video at the clip level? How did you do it?

@Lynneyyq Lynneyyq changed the title How audio and video are aligned at the clip level How to align the audio and video at the clip level Jun 9, 2022
@Lynneyyq
Copy link
Author

Lynneyyq commented Jun 9, 2022

image
image
The above two pictures print the feature dimensions of video and audio respectively. How to align the audio and video marked in the red box?
Thank you very much for taking the time to answer.

@Mofafa
Copy link
Collaborator

Mofafa commented Jun 10, 2022

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

@Lynneyyq
Copy link
Author

Thank you for your reply!
At the beginning, I also considered the time is fixed. However, the reference codebase you provided does not involve related parameters.
The PANN codebase seems to extract features from the spectrogram of sound. Did not see some parameters related to the clip time index.
If I extract features from the audio according to the time clip, PANN will get a feature map of n*2048.
Ideally, each time clip should get a 1*2048 feature vector, which exactly corresponds to the feature matrix of the video. How did you do it?
Please forgive me for asking this question. Thank you very much!

@Lynneyyq
Copy link
Author

Lynneyyq commented Jun 13, 2022

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

  1. Long audio is also divided into multiple clips for alignment with video, such as audio clip time index = [0,2],[1,3],[2,4],[3,5]. For each audio clip, first calculate the spectrogram with STFT, and then use convolution to extract features, the feature matrix is ​​n*2048. But only the feature vector 1*2048 obtained for each audio clip can be aligned with the video. Is it reasonable to average or maximize the feature matrix n*2048 of each audio clip to get 1*2048?

  2. Audio and video alignment, how did you do it?

@Mofafa
Copy link
Collaborator

Mofafa commented Jun 13, 2022

Thank you for your reply! At the beginning, I also considered the time is fixed. However, the reference codebase you provided does not involve related parameters. The PANN codebase seems to extract features from the spectrogram of sound. Did not see some parameters related to the clip time index. If I extract features from the audio according to the time clip, PANN will get a feature map of n*2048. Ideally, each time clip should get a 1*2048 feature vector, which exactly corresponds to the feature matrix of the video. How did you do it? Please forgive me for asking this question. Thank you very much!

All the feature vectors belonging to the same clip are averaged. In this case, we calculate the average of the first dimension.

@hpppppp8
Copy link

Thank you for your reply! At the beginning, I also considered the time is fixed. However, the reference codebase you provided does not involve related parameters. The PANN codebase seems to extract features from the spectrogram of sound. Did not see some parameters related to the clip time index. If I extract features from the audio according to the time clip, PANN will get a feature map of n*2048. Ideally, each time clip should get a 1*2048 feature vector, which exactly corresponds to the feature matrix of the video. How did you do it? Please forgive me for asking this question. Thank you very much!

All the feature vectors belonging to the same clip are averaged. In this case, we calculate the average of the first dimension.

In AudioTagging process:
a 150 sec video [1, 4798625] -> [1, 1, 14996, 513] -> [1, 1, 14996, 64] -> [1, 64, 7498, 32] ->(some Conv process) -> and final got [1, 2048, 468].
You means 468 to calculate the average according to 2 sec clip?
How could you do that?Or is there something wrong with my process?

@hpppppp8
Copy link

I also thought about the simple way, input the 2 sec clip to the model, but it seems time consume to me to get the audio feature.

@yeliudev
Copy link
Member

The [1, 2048, 468] matrix should be the sequence of audio features for the whole 150s video. You may divide them into 2s clips, each with the shape [1, 2048, 6] or [1, 2048, 7], and average them into [1, 2048].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants