Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why embedding the audio features #58

Open
e4s2022 opened this issue Jun 29, 2022 · 0 comments
Open

why embedding the audio features #58

e4s2022 opened this issue Jun 29, 2022 · 0 comments

Comments

@e4s2022
Copy link

e4s2022 commented Jun 29, 2022

Hi, thanks for sharing this great work!

I understand the main pipeline, i.e., encoding the speech content features, id features, and pose features respectively then feeding them to the generator for the driven results. But I am a little bit confused after reading the inference code.

A_mouth_feature = self.encode_audiosync_feature(spectrogram)
A_mouth_feature = A_mouth_feature * mouth_feature_weight
sel_id_feature = []
sel_id_feature.append(self.select_frames(id_feature[0]))
sel_id_feature.append(self.select_frames(id_feature[1]))
V_noid_ref_feature = self.encode_ref_noid(input_img)
V_headpose_ref_feature = self.netE.to_headpose(V_noid_ref_feature)
ref_merge_feature_a = self.select_frames(self.merge_mouthpose(A_mouth_feature, V_headpose_ref_feature))
fake_image_ref_pose_a, _ = self.generate_fake(sel_id_feature, ref_merge_feature_a)

As can be seen, the mel-spectrogram is encoded by the audio encoder first in Line 473 and is ready to be fused with the pose feature in Line 483. However, in the merge_mouthpose() function:

def merge_mouthpose(self, mouth_feature, headpose_feature, embed_headpose=False):
mouth_feature = self.netE.mouth_embed(mouth_feature)
if not embed_headpose:
headpose_feature = self.netE.headpose_embed(headpose_feature)
pose_feature = torch.cat((mouth_feature, headpose_feature), dim=2)
return pose_feature

I found the audio features are further embedded, what is the intuition behind that? In my view, the netE.mouth_embed would be used to embed the mouth features for the video but NOT for the audio. If anything is wrong, please correct me. Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant