New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to align the audio and video at the clip level #18
Comments
Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index. |
Thank you for your reply! |
|
All the feature vectors belonging to the same clip are averaged. In this case, we calculate the average of the first dimension. |
In AudioTagging process: |
I also thought about the simple way, input the 2 sec clip to the model, but it seems time consume to me to get the audio feature. |
The |
Your paper says "Visual and audio features are temporally aligned at clip level". For example, in the YouTube Highlight dataset, the video is divided into clips every 100 frames, and the overlap is 50%. I extract audio features through the codebase you provided.
How to align the audio and video at the clip level? How did you do it?
The text was updated successfully, but these errors were encountered: