Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to align the audio feature and video feature? #17

Closed
Xuguozi opened this issue Jun 6, 2022 · 7 comments
Closed

how to align the audio feature and video feature? #17

Xuguozi opened this issue Jun 6, 2022 · 7 comments

Comments

@Xuguozi
Copy link

Xuguozi commented Jun 6, 2022

If the size of video feature is [14, 2048], i need to extract the audio feature which size is [14, 2048].

Follow you, I use the PANN_inference project to extract audio feature from raw wave file.
Because of video clips and overlap operation, the first dimension of video feature is 14. How to align the audio feature and video feature?

I found that the size of audio feature is related to the sample rate , window size, hop size and anymore, what i should set the parameter.
I want to know more details about how to extract the audio feature, thank you.

Screenshot from 2022-06-06 21-58-33

@Mofafa
Copy link
Collaborator

Mofafa commented Jun 10, 2022

We use the default settings in PANN. The sample rate is 32000.

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

@G-Apple1
Copy link

We use the default settings in PANN. The sample rate is 32000.

Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

for the audio feature, the code is libsora.core.load(offest=0/1/2, duration=2/3/4) related to the video clip index [0,50]/[25,75]/[50,100].

@Mofafa
Copy link
Collaborator

Mofafa commented Jun 13, 2022

We use the default settings in PANN. The sample rate is 32000.
Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

for the audio feature, the code is libsora.core.load(offest=0/1/2, duration=2/3/4) related to the video clip index [0,50]/[25,75]/[50,100].

We first extract the audio feature from the whole video by using the default setting in PANN. Then we calculate the feature index corresponding to each clip and calculate the average of all features corresponding to each clip.

@G-Apple1
Copy link

We use the default settings in PANN. The sample rate is 32000.
Since time is fixed, the audio and video feature dimensions can be aligned according to time. You can get the fps for each original video and the frame index for each video clip from the dataset annotation. Then you can convert the frame index into the time index. (e.g. fps=25, clip frame index = [50, 100] ——> clip time index = [2, 4]). Similarly, you can derive the corresponding feature index according to the time index.

for the audio feature, the code is libsora.core.load(offest=0/1/2, duration=2/3/4) related to the video clip index [0,50]/[25,75]/[50,100].

We first extract the audio feature from the whole video by using the default setting in PANN. Then we calculate the feature index corresponding to each clip and calculate the average of all features corresponding to each clip.

Can you give a specific example?

@yeliudev
Copy link
Member

Please refer to #18.

@G-Apple1
Copy link

Please refer to #18.

I read the answer of #18 and it seems to be a bit inconsistent with the above. #18 said that it is time-aligned. My understanding is to first segment the audio by time and then enter the network to extract features. Here it is to first enter the entire audio into the network to extract features. , and then split.

If there is a 10s audio, it is divided into 5 parts at intervals of 2 seconds, and the extracted features are [5, 2048], but if the entire audio enters the network feature is [11, 2048], how does this feature correspond?

@yeliudev
Copy link
Member

Do you mean the audio feature for the whole 10s video is [11, 2048]? In this case, you may still make it temporally aligned with the video. Since this is a short video, using [2, 2048] for each video clip (resulting in 5 * [2, 2048] audio features), and dropping the last one should be fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants