New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
how to align the audio feature and video feature? #17
Comments
We use the default settings in PANN. The sample rate is 32000. Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index. |
for the audio feature, the code is |
We first extract the audio feature from the whole video by using the default setting in PANN. Then we calculate the feature index corresponding to each clip and calculate the average of all features corresponding to each clip. |
Can you give a specific example? |
Please refer to #18. |
I read the answer of #18 and it seems to be a bit inconsistent with the above. #18 said that it is time-aligned. My understanding is to first segment the audio by time and then enter the network to extract features. Here it is to first enter the entire audio into the network to extract features. , and then split. If there is a 10s audio, it is divided into 5 parts at intervals of 2 seconds, and the extracted features are [5, 2048], but if the entire audio enters the network feature is [11, 2048], how does this feature correspond? |
Do you mean the audio feature for the whole 10s video is [11, 2048]? In this case, you may still make it temporally aligned with the video. Since this is a short video, using [2, 2048] for each video clip (resulting in 5 * [2, 2048] audio features), and dropping the last one should be fine. |
If the size of video feature is [14, 2048], i need to extract the audio feature which size is [14, 2048].
Follow you, I use the PANN_inference project to extract audio feature from raw wave file.
Because of video clips and overlap operation, the first dimension of video feature is 14. How to align the audio feature and video feature?
I found that the size of audio feature is related to the sample rate , window size, hop size and anymore, what i should set the parameter.
I want to know more details about how to extract the audio feature, thank you.
The text was updated successfully, but these errors were encountered: