how to align the audio feature and video feature? #17

Xuguozi · 2022-06-06T14:35:10Z

If the size of video feature is [14, 2048], i need to extract the audio feature which size is [14, 2048].

Follow you, I use the PANN_inference project to extract audio feature from raw wave file.
Because of video clips and overlap operation, the first dimension of video feature is 14. How to align the audio feature and video feature?

I found that the size of audio feature is related to the sample rate , window size, hop size and anymore, what i should set the parameter.
I want to know more details about how to extract the audio feature, thank you.

Mofafa · 2022-06-10T09:05:27Z

We use the default settings in PANN. The sample rate is 32000.

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

G-Apple1 · 2022-06-10T09:28:00Z

We use the default settings in PANN. The sample rate is 32000.

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

for the audio feature, the code is libsora.core.load(offest=0/1/2, duration=2/3/4) related to the video clip index [0,50]/[25,75]/[50,100].

Mofafa · 2022-06-13T02:41:59Z

We use the default settings in PANN. The sample rate is 32000.
Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

for the audio feature, the code is libsora.core.load(offest=0/1/2, duration=2/3/4) related to the video clip index [0,50]/[25,75]/[50,100].

We first extract the audio feature from the whole video by using the default setting in PANN. Then we calculate the feature index corresponding to each clip and calculate the average of all features corresponding to each clip.

G-Apple1 · 2022-06-13T03:00:52Z

We use the default settings in PANN. The sample rate is 32000.
Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

for the audio feature, the code is libsora.core.load(offest=0/1/2, duration=2/3/4) related to the video clip index [0,50]/[25,75]/[50,100].

We first extract the audio feature from the whole video by using the default setting in PANN. Then we calculate the feature index corresponding to each clip and calculate the average of all features corresponding to each clip.

Can you give a specific example？

yeliudev · 2022-06-13T03:43:59Z

Please refer to #18.

G-Apple1 · 2022-06-13T07:21:52Z

Please refer to #18.

I read the answer of #18 and it seems to be a bit inconsistent with the above. #18 said that it is time-aligned. My understanding is to first segment the audio by time and then enter the network to extract features. Here it is to first enter the entire audio into the network to extract features. , and then split.

If there is a 10s audio, it is divided into 5 parts at intervals of 2 seconds, and the extracted features are [5, 2048], but if the entire audio enters the network feature is [11, 2048], how does this feature correspond?

yeliudev · 2022-06-13T09:30:15Z

Do you mean the audio feature for the whole 10s video is [11, 2048]? In this case, you may still make it temporally aligned with the video. Since this is a short video, using [2, 2048] for each video clip (resulting in 5 * [2, 2048] audio features), and dropping the last one should be fine.

yeliudev closed this as completed Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to align the audio feature and video feature? #17

how to align the audio feature and video feature? #17

Xuguozi commented Jun 6, 2022

Mofafa commented Jun 10, 2022

G-Apple1 commented Jun 10, 2022

Mofafa commented Jun 13, 2022

G-Apple1 commented Jun 13, 2022

yeliudev commented Jun 13, 2022

G-Apple1 commented Jun 13, 2022

yeliudev commented Jun 13, 2022

how to align the audio feature and video feature? #17

how to align the audio feature and video feature? #17

Comments

Xuguozi commented Jun 6, 2022

Mofafa commented Jun 10, 2022

G-Apple1 commented Jun 10, 2022

Mofafa commented Jun 13, 2022

G-Apple1 commented Jun 13, 2022

yeliudev commented Jun 13, 2022

G-Apple1 commented Jun 13, 2022

yeliudev commented Jun 13, 2022