some problem about the paper #2

xiaowen0110 · 2021-01-02T13:52:30Z

Your project is vary awesome! I am trying to play your methor by myself, but I have some comfusions. Can you give me some suggestions?
In one words, I'm not sure the output shape of some Modules. Follow your describe, and given a input Tensor with shape 1x3x256x256. I got the fs with shape 1x32x16x64x64, the output of last UpBlock3D of L is 1x32x512x256x256(that cost so much in calculate Jc,k).

1.Appearance feature ectractor F: this is simple, but I want to make sure if the output(that is called fs) will get a shape 1x32x16x64x64 (The shape of input is 1x3x256x256). The fs after warping will be feed into the Motion field estimator M, but there are 5 DownBlock3D, the D with 16 just need 4 downblocks will become 1, why should we need 5 donwblocks?
2. The path of occlusion in Motion field estimator M: Why that will have a Reshape C137*D16->C2192, the output of the last UpBlock3D will have 32 channels, how much about the D? 137x16/32=68.5, and I think the D should be 16 just as the same of fs.
3. The path of mask in Motion field estimator M: there is a 7x7x7-Conv-21, k is 20, why C is 21. And is it need a global pooling? The mask is a 20-d number? Just multiple to the every pixel of Wk?
4. I want to make sure the operation of 3D block such as UpBlock3D, will it double D just like the opration to H and W?

blues-green · 2021-01-14T06:45:41Z

Hello, have you reimplemented the project ?

AY1997 · 2021-01-18T06:59:44Z

I guess 137 is a typing error.

deepkyu · 2021-05-06T08:05:41Z

Haven't solved yet? I'm also curious about the number 137.... 👍🏻

zhanglonghao1992 · 2021-06-04T12:15:38Z

I have re-implemented this paper, and the experiment has achieved initial results.

I don't think D should change with downsampling or upsampling.
I really don't understand "137".
C is set to "21" to handle the background. The mask corresponding to the background is very important, although it is not mentioned in the paper. About this part, you can refer to the paper FOMM.

Here is a free-view talking head demo ( animation with different roll, yaw, pitch) of my implementation:

concat-14.mp4

xiaowen0110 · 2021-06-15T08:49:45Z

I have re-implemented this paper, and the experiment has achieved initial results.

I don't think D should change with downsampling or upsampling.

I really don't understand "137".

C is set to "21" to handle the background. The mask corresponding to the background is very important, although it is not mentioned in the paper. About this part, you can refer to the paper FOMM.

Here is a free-view talking head demo ( animation with different roll, yaw, pitch) of my implementation:

concat-14.mp4
Thanks a lot. Do you have plan to release your code? Are the weights of losses similar to the paper? My re-implementation can not generate good results, and it will generate broken images. @zhanglonghao1992

zhanglonghao1992 · 2021-06-15T09:29:04Z

I have re-implemented this paper, and the experiment has achieved initial results.

I don't think D should change with downsampling or upsampling.

I really don't understand "137".

C is set to "21" to handle the background. The mask corresponding to the background is very important, although it is not mentioned in the paper. About this part, you can refer to the paper FOMM.

Here is a free-view talking head demo ( animation with different roll, yaw, pitch) of my implementation:
concat-14.mp4
Thanks a lot. Do you have plan to release your code? Are the weights of losses similar to the paper? My re-implementation can not generate good results, and it will generate broken images. @zhanglonghao1992

Right now, it seems like the model does work, but there are still a lot of problems to be solved.
For example, the opening and closing of eyes and the mouth shape are not consistent with the driving, which I think is caused by the inaccurate occlusion mask：

Training my model (for visualization, 3D key-points are projected to 2D):

Training FOMM:

In addition, I found that the detection of 3D key-points was not accurate or even reasonable.
I might release my code after I solve these problems. By the way, you can contact me by email (longhao.zlh@alibaba-inc.com), and we can discuss your problems in detail.

zhanglonghao1992 · 2021-06-19T07:44:11Z

Update:
It seems that the problem of eyes opening and closing has been solved.

I'will release the code later. @XiaoWen-AI @charan223

tcwang0509 · 2021-06-21T21:17:24Z

Nice to see the reimplementation! Sorry for the delay in the code release, we are still in the long tedious process of getting company approval... hopefully it can get approved soon.
Regarding the number 137, we first compress the input 3D source features to 4 channels, and then warp them according to each keypoint motion. Then along with the heatmap, we have 5 channels per keypoint. We have 20 keypoints, and add 1 which adopts the identity transformation (i.e. no warping), so 5x(20+1)=105 channels. This is the input to the U-net. The last layer of U-Net has 32 channels, so 105+32=137.
Hope this helps, and please let me know if you have any other questions!

zhanglonghao1992 · 2021-06-22T12:43:11Z

Nice to see the reimplementation! Sorry for the delay in the code release, we are still in the long tedious process of getting company approval... hopefully it can get approved soon.
Regarding the number 137, we first compress the input 3D source features to 4 channels, and then warp them according to each keypoint motion. Then along with the heatmap, we have 5 channels per keypoint. We have 20 keypoints, and add 1 which adopts the identity transformation (i.e. no warping), so 5x(20+1)=105 channels. This is the input to the U-net. The last layer of U-Net has 32 channels, so 105+32=137.
Hope this helps, and please let me know if you have any other questions!

Thank you very much for your explanation of "137". Can you tell me more details about feature compression, like how many 3D convolution layers you use and what's the kernel size?
Besides, I find that the last two 7x7x7 convolution layers of the Canonical Keypoint Detector consume too much computation and memory (when the resolution of the input image is 256x256). Should the input image be downsampled first? What is the scale factor?
I train the model on the VoxCeleb dataset and set the scale factor of the Canonical Keypoint Detector to 0.25. Training can be completely out of control after several epochs (headpose loss suddenly rises) when the number of keypoint is set to 20. But when I set the number of keypoint to 10, the training is stable.

tcwang0509 · 2021-06-22T18:56:58Z

It's a very simple 1x1x1 conv layer only.
In our further experiments, we've tried replacing the kernel with 3x3x3, and observed no reduction in quality. The Jacobian part is also removed in the new version.
On Voxceleb I've tried 15 and it's almost as good as 20. Haven't tried 10 though.

zhengkw18 · 2021-07-06T10:47:54Z

Nice to see the reimplementation! Sorry for the delay in the code release, we are still in the long tedious process of getting company approval... hopefully it can get approved soon.
Regarding the number 137, we first compress the input 3D source features to 4 channels, and then warp them according to each keypoint motion. Then along with the heatmap, we have 5 channels per keypoint. We have 20 keypoints, and add 1 which adopts the identity transformation (i.e. no warping), so 5x(20+1)=105 channels. This is the input to the U-net. The last layer of U-Net has 32 channels, so 105+32=137.
Hope this helps, and please let me know if you have any other questions!

Thank you very much for this explanation! I'm also starting to re-implement this paper, but I have confusions about the architecture figure, some of which are already mentioned in comments before:

Should 3D upsample and downsample affect D?
I think the "ResBottleneck" structure explained in the figure can't handle the change of the number of channels in Head pose estimator. Is it ok to just refer to the Resnet50 architecture?
Besides, since the input spatial dimensions of Resnet50 is 224x224, a 7x7 avgpool at the end is enough to remove the output spatial dimensions. But when the input is 256x256, the spatial dimensions at the end is 8x8, should I use a 8x8 avgpool?
How is wk(fs) concatenated to form the input of U-Net in Motion field estimator? You explained how 137 comes, and how is fs compressed to 4 channels? And the spatial dimensions of fs is 16x64x64, but the spatial dimensions of heatmap is 16x256x256 (if D is not doubled in UpSample), how are they concatenated to form 5 channels?

xiaowen0110 · 2021-07-07T12:13:26Z

Thanks to your advices, my reimplementation seems work now. But the generated occlusion_map is vary different from yours. And the same problem such as eyes and mouth opening and closing isn't work. What revises do you made to fix it? In addition, the key points is vary dense and without semantic. I will try to solve these problems in in my free time. Thanks again! @zhanglonghao1992

charan223 · 2021-07-15T05:11:46Z

I have a similar question as @zhengkw18 @tcwang0509

Can you explain how the input of the U-Net in Motion field estimator is formed?

What heatmap are we concatenating?
How is fs compressed into 4 channels?
Input size of fs is 16x64x64, if the heatmap is the output of canonical keypoint detector, these would be the size of 16x256x256. How are we downsampling it?

zhengkw18 · 2021-07-15T07:26:07Z

I have a similar question as @zhengkw18 @tcwang0509

Can you explain how the input of the U-Net in Motion field estimator is formed?

What heatmap are we concatenating?

How is fs compressed into 4 channels?

Input size of fs is 16x64x64, if the heatmap is the output of canonical keypoint detector, these would be the size of 16x256x256. How are we downsampling it?

Well, I read the code of FOMM and found most of the questions solved. This paper is similar to FOMM in many aspects.

In canonical keypoint detector, the output is transformed to heatmap using softmax, and the heatmap gives the ketpoints by weighting the grid coordinate
The author says it's a 1x1x1 3d conv
In fact, the heatmap concatenated to compressed fs is not the original heatmap, but the re-generated one using the key points and gaussian. The process of re-generation allows any spatial shape, so we can set it to 16x64x64.

I'm implementing this paper these days to see if my understanding holds.

Besides, I wonder whether the author uses some special methods to model the motion of eyes and mouth. In the provided demo, we can even control eye rotation. I think such thing is not mentioned in the paper.

mingyuliutw · 2021-07-15T23:14:21Z

FOMM is definitely a good reference repo. I recommend everybody who wants to work on motion transfer reads the FOMM paper. Here, I want to point out several major differences between our model and FOMM to help reproduce the results.

We do not estimate Jacobians for the keypoints, which are the first-order information FOMM (first-order motion model) depends on. We can get away with the Jacobian estimation because of a rigid head assumption. As we do not need Jacobians, we only need to compress 3D keypoints, which gives us an advantage for low-bit-rate video conferencing calls.
Our keypoints are 3D and follow decomposition forms. FOMM's keypoints are 2D and do not follow a decomposition form. The decomposition allows an intuitive way to change the head pose, the face-redirection feature presented in the paper.
We found using a GAN discriminator improves the rendering realisim for our model. FOMM found GAN discriminator is not useful.

zhanglonghao1992 · 2021-07-16T10:49:55Z

@XiaoWen-AI
Just follow @tcwang0509 's advice. I've made some progress in key points and expression.

zhengkw18 · 2021-07-25T08:28:15Z

I made some progress these days to get barely satisfactory results. (left to right: source, warped driving, driving, reconstructed)
Still, I think my implementation or my training is far from satisfactory.

I train the model with about 100W random pairs of images from VoxCeleb1. It took 5 days on four 2080Ti, which is under the condition that I drop the Vgg19 pretrained part to save memory. Otherwise, one GPU can bear only one sample per batch.
The model reconstruct faces of different rolls well. But for images with large difference in yaw, the reconstruction is sometimes broken.
The estimated keypoints for different yaws are not consistent or reasonable (for large yaw, the keypoints often concentrate on half of the face, or even out of the face). Also, the keypoints of different persons with different poses are not consistent.
Until now, the expressions are not reconstructed.
The occlusion map seems to have no effects (all values close to 1).

I will continue training to see if the results can improve. It will be great help if anyone has suggestions to my problems.

charan223 · 2021-07-25T21:05:49Z

Hello,
What kind of input video preprocessing are you doing? Is there any landmark alignment you are doing to make the images 256x256?
Can you guide me to any codebases/references?
And how much time is it taking for processing the entire voxceleb2 dataset?

zbdehh · 2021-07-27T01:25:26Z

I made some progress these days to get barely satisfactory results. (left to right: source, warped driving, driving, reconstructed)
Still, I think my implementation or my training is far from satisfactory.

I train the model with about 100W random pairs of images from VoxCeleb1. It took 5 days on four 2080Ti, which is under the condition that I drop the Vgg19 pretrained part to save memory. Otherwise, one GPU can bear only one sample per batch.

The model reconstruct faces of different rolls well. But for images with large difference in yaw, the reconstruction is sometimes broken.

The estimated keypoints for different yaws are not consistent or reasonable (for large yaw, the keypoints often concentrate on half of the face, or even out of the face). Also, the keypoints of different persons with different poses are not consistent.

Until now, the expressions are not reconstructed.

The occlusion map seems to have no effects (all values close to 1).

I will continue training to see if the results can improve. It will be great help if anyone has suggestions to my problems.

HI, we encountered the same problem. The reconstructed face has no expression.

xiaowen0110 closed this as completed Aug 2, 2021

zhanglonghao1992 mentioned this issue Sep 14, 2021

Compressing the features zhanglonghao1992/One-Shot_Free-View_Neural_Talking_Head_Synthesis#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

some problem about the paper #2

some problem about the paper #2

xiaowen0110 commented Jan 2, 2021 •

edited

Loading

blues-green commented Jan 14, 2021

AY1997 commented Jan 18, 2021

deepkyu commented May 6, 2021

zhanglonghao1992 commented Jun 4, 2021 •

edited

Loading

xiaowen0110 commented Jun 15, 2021

zhanglonghao1992 commented Jun 15, 2021 •

edited

Loading

zhanglonghao1992 commented Jun 19, 2021

tcwang0509 commented Jun 21, 2021

zhanglonghao1992 commented Jun 22, 2021

tcwang0509 commented Jun 22, 2021

zhengkw18 commented Jul 6, 2021 •

edited

Loading

xiaowen0110 commented Jul 7, 2021

charan223 commented Jul 15, 2021

zhengkw18 commented Jul 15, 2021 •

edited

Loading

mingyuliutw commented Jul 15, 2021

zhanglonghao1992 commented Jul 16, 2021

zhengkw18 commented Jul 25, 2021

charan223 commented Jul 25, 2021 •

edited

Loading

zbdehh commented Jul 27, 2021 •

edited

Loading

some problem about the paper #2

some problem about the paper #2

Comments

xiaowen0110 commented Jan 2, 2021 • edited Loading

blues-green commented Jan 14, 2021

AY1997 commented Jan 18, 2021

deepkyu commented May 6, 2021

zhanglonghao1992 commented Jun 4, 2021 • edited Loading

xiaowen0110 commented Jun 15, 2021

zhanglonghao1992 commented Jun 15, 2021 • edited Loading

zhanglonghao1992 commented Jun 19, 2021

tcwang0509 commented Jun 21, 2021

zhanglonghao1992 commented Jun 22, 2021

tcwang0509 commented Jun 22, 2021

zhengkw18 commented Jul 6, 2021 • edited Loading

xiaowen0110 commented Jul 7, 2021

charan223 commented Jul 15, 2021

zhengkw18 commented Jul 15, 2021 • edited Loading

mingyuliutw commented Jul 15, 2021

zhanglonghao1992 commented Jul 16, 2021

zhengkw18 commented Jul 25, 2021

charan223 commented Jul 25, 2021 • edited Loading

zbdehh commented Jul 27, 2021 • edited Loading

xiaowen0110 commented Jan 2, 2021 •

edited

Loading

zhanglonghao1992 commented Jun 4, 2021 •

edited

Loading

zhanglonghao1992 commented Jun 15, 2021 •

edited

Loading

zhengkw18 commented Jul 6, 2021 •

edited

Loading

zhengkw18 commented Jul 15, 2021 •

edited

Loading

charan223 commented Jul 25, 2021 •

edited

Loading

zbdehh commented Jul 27, 2021 •

edited

Loading