Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some problem about the paper #2

Closed
xiaowen0110 opened this issue Jan 2, 2021 · 19 comments
Closed

some problem about the paper #2

xiaowen0110 opened this issue Jan 2, 2021 · 19 comments

Comments

@xiaowen0110
Copy link

xiaowen0110 commented Jan 2, 2021

Your project is vary awesome! I am trying to play your methor by myself, but I have some comfusions. Can you give me some suggestions?
In one words, I'm not sure the output shape of some Modules. Follow your describe, and given a input Tensor with shape 1x3x256x256. I got the fs with shape 1x32x16x64x64, the output of last UpBlock3D of L is 1x32x512x256x256(that cost so much in calculate Jc,k).

1.Appearance feature ectractor F: this is simple, but I want to make sure if the output(that is called fs) will get a shape 1x32x16x64x64 (The shape of input is 1x3x256x256). The fs after warping will be feed into the Motion field estimator M, but there are 5 DownBlock3D, the D with 16 just need 4 downblocks will become 1, why should we need 5 donwblocks?
2. The path of occlusion in Motion field estimator M: Why that will have a Reshape C137*D16->C2192, the output of the last UpBlock3D will have 32 channels, how much about the D? 137x16/32=68.5, and I think the D should be 16 just as the same of fs.
3. The path of mask in Motion field estimator M: there is a 7x7x7-Conv-21, k is 20, why C is 21. And is it need a global pooling? The mask is a 20-d number? Just multiple to the every pixel of Wk?
4. I want to make sure the operation of 3D block such as UpBlock3D, will it double D just like the opration to H and W?

@blues-green
Copy link

Hello, have you reimplemented the project ?

@AY1997
Copy link

AY1997 commented Jan 18, 2021

I guess 137 is a typing error.

@deepkyu
Copy link

deepkyu commented May 6, 2021

Haven't solved yet? I'm also curious about the number 137.... 👍🏻

@zhanglonghao1992
Copy link

zhanglonghao1992 commented Jun 4, 2021

I have re-implemented this paper, and the experiment has achieved initial results.

  1. I don't think D should change with downsampling or upsampling.
  2. I really don't understand "137".
  3. C is set to "21" to handle the background. The mask corresponding to the background is very important, although it is not mentioned in the paper. About this part, you can refer to the paper FOMM.

Here is a free-view talking head demo ( animation with different roll, yaw, pitch) of my implementation:

concat-14.mp4

@xiaowen0110
Copy link
Author

I have re-implemented this paper, and the experiment has achieved initial results.

  1. I don't think D should change with downsampling or upsampling.
  2. I really don't understand "137".
  3. C is set to "21" to handle the background. The mask corresponding to the background is very important, although it is not mentioned in the paper. About this part, you can refer to the paper FOMM.

Here is a free-view talking head demo ( animation with different roll, yaw, pitch) of my implementation:

concat-14.mp4
Thanks a lot. Do you have plan to release your code? Are the weights of losses similar to the paper? My re-implementation can not generate good results, and it will generate broken images. @zhanglonghao1992

@zhanglonghao1992
Copy link

zhanglonghao1992 commented Jun 15, 2021

I have re-implemented this paper, and the experiment has achieved initial results.

  1. I don't think D should change with downsampling or upsampling.
  2. I really don't understand "137".
  3. C is set to "21" to handle the background. The mask corresponding to the background is very important, although it is not mentioned in the paper. About this part, you can refer to the paper FOMM.

Here is a free-view talking head demo ( animation with different roll, yaw, pitch) of my implementation:
concat-14.mp4
Thanks a lot. Do you have plan to release your code? Are the weights of losses similar to the paper? My re-implementation can not generate good results, and it will generate broken images. @zhanglonghao1992

Right now, it seems like the model does work, but there are still a lot of problems to be solved.
For example, the opening and closing of eyes and the mouth shape are not consistent with the driving, which I think is caused by the inaccurate occlusion mask:

Training my model (for visualization, 3D key-points are projected to 2D):
1 上午11 36 02
Training FOMM:
1 下午12 43 26

In addition, I found that the detection of 3D key-points was not accurate or even reasonable.
I might release my code after I solve these problems. By the way, you can contact me by email (longhao.zlh@alibaba-inc.com), and we can discuss your problems in detail.

@zhanglonghao1992
Copy link

Update:
It seems that the problem of eyes opening and closing has been solved.
1

I'will release the code later. @XiaoWen-AI @charan223

@tcwang0509
Copy link
Contributor

Nice to see the reimplementation! Sorry for the delay in the code release, we are still in the long tedious process of getting company approval... hopefully it can get approved soon.
Regarding the number 137, we first compress the input 3D source features to 4 channels, and then warp them according to each keypoint motion. Then along with the heatmap, we have 5 channels per keypoint. We have 20 keypoints, and add 1 which adopts the identity transformation (i.e. no warping), so 5x(20+1)=105 channels. This is the input to the U-net. The last layer of U-Net has 32 channels, so 105+32=137.
Hope this helps, and please let me know if you have any other questions!

@zhanglonghao1992
Copy link

Nice to see the reimplementation! Sorry for the delay in the code release, we are still in the long tedious process of getting company approval... hopefully it can get approved soon.
Regarding the number 137, we first compress the input 3D source features to 4 channels, and then warp them according to each keypoint motion. Then along with the heatmap, we have 5 channels per keypoint. We have 20 keypoints, and add 1 which adopts the identity transformation (i.e. no warping), so 5x(20+1)=105 channels. This is the input to the U-net. The last layer of U-Net has 32 channels, so 105+32=137.
Hope this helps, and please let me know if you have any other questions!

Thank you very much for your explanation of "137". Can you tell me more details about feature compression, like how many 3D convolution layers you use and what's the kernel size?
Besides, I find that the last two 7x7x7 convolution layers of the Canonical Keypoint Detector consume too much computation and memory (when the resolution of the input image is 256x256). Should the input image be downsampled first? What is the scale factor?
I train the model on the VoxCeleb dataset and set the scale factor of the Canonical Keypoint Detector to 0.25. Training can be completely out of control after several epochs (headpose loss suddenly rises) when the number of keypoint is set to 20. But when I set the number of keypoint to 10, the training is stable.

@tcwang0509
Copy link
Contributor

  1. It's a very simple 1x1x1 conv layer only.
  2. In our further experiments, we've tried replacing the kernel with 3x3x3, and observed no reduction in quality. The Jacobian part is also removed in the new version.
  3. On Voxceleb I've tried 15 and it's almost as good as 20. Haven't tried 10 though.

@zhengkw18
Copy link

zhengkw18 commented Jul 6, 2021

Nice to see the reimplementation! Sorry for the delay in the code release, we are still in the long tedious process of getting company approval... hopefully it can get approved soon.
Regarding the number 137, we first compress the input 3D source features to 4 channels, and then warp them according to each keypoint motion. Then along with the heatmap, we have 5 channels per keypoint. We have 20 keypoints, and add 1 which adopts the identity transformation (i.e. no warping), so 5x(20+1)=105 channels. This is the input to the U-net. The last layer of U-Net has 32 channels, so 105+32=137.
Hope this helps, and please let me know if you have any other questions!

Thank you very much for this explanation! I'm also starting to re-implement this paper, but I have confusions about the architecture figure, some of which are already mentioned in comments before:

  1. Should 3D upsample and downsample affect D?
  2. I think the "ResBottleneck" structure explained in the figure can't handle the change of the number of channels in Head pose estimator. Is it ok to just refer to the Resnet50 architecture?
  3. Besides, since the input spatial dimensions of Resnet50 is 224x224, a 7x7 avgpool at the end is enough to remove the output spatial dimensions. But when the input is 256x256, the spatial dimensions at the end is 8x8, should I use a 8x8 avgpool?
  4. How is wk(fs) concatenated to form the input of U-Net in Motion field estimator? You explained how 137 comes, and how is fs compressed to 4 channels? And the spatial dimensions of fs is 16x64x64, but the spatial dimensions of heatmap is 16x256x256 (if D is not doubled in UpSample), how are they concatenated to form 5 channels?

@xiaowen0110
Copy link
Author

Thanks to your advices, my reimplementation seems work now. But the generated occlusion_map is vary different from yours. And the same problem such as eyes and mouth opening and closing isn't work. What revises do you made to fix it? In addition, the key points is vary dense and without semantic. I will try to solve these problems in in my free time. Thanks again! @zhanglonghao1992
image

@charan223
Copy link

I have a similar question as @zhengkw18 @tcwang0509

Can you explain how the input of the U-Net in Motion field estimator is formed?

  1. What heatmap are we concatenating?
  2. How is fs compressed into 4 channels?
  3. Input size of fs is 16x64x64, if the heatmap is the output of canonical keypoint detector, these would be the size of 16x256x256. How are we downsampling it?

@zhengkw18
Copy link

zhengkw18 commented Jul 15, 2021

I have a similar question as @zhengkw18 @tcwang0509

Can you explain how the input of the U-Net in Motion field estimator is formed?

  1. What heatmap are we concatenating?
  2. How is fs compressed into 4 channels?
  3. Input size of fs is 16x64x64, if the heatmap is the output of canonical keypoint detector, these would be the size of 16x256x256. How are we downsampling it?

Well, I read the code of FOMM and found most of the questions solved. This paper is similar to FOMM in many aspects.

  1. In canonical keypoint detector, the output is transformed to heatmap using softmax, and the heatmap gives the ketpoints by weighting the grid coordinate
  2. The author says it's a 1x1x1 3d conv
  3. In fact, the heatmap concatenated to compressed fs is not the original heatmap, but the re-generated one using the key points and gaussian. The process of re-generation allows any spatial shape, so we can set it to 16x64x64.

I'm implementing this paper these days to see if my understanding holds.

Besides, I wonder whether the author uses some special methods to model the motion of eyes and mouth. In the provided demo, we can even control eye rotation. I think such thing is not mentioned in the paper.

@mingyuliutw
Copy link
Contributor

FOMM is definitely a good reference repo. I recommend everybody who wants to work on motion transfer reads the FOMM paper. Here, I want to point out several major differences between our model and FOMM to help reproduce the results.

  1. We do not estimate Jacobians for the keypoints, which are the first-order information FOMM (first-order motion model) depends on. We can get away with the Jacobian estimation because of a rigid head assumption. As we do not need Jacobians, we only need to compress 3D keypoints, which gives us an advantage for low-bit-rate video conferencing calls.
  2. Our keypoints are 3D and follow decomposition forms. FOMM's keypoints are 2D and do not follow a decomposition form. The decomposition allows an intuitive way to change the head pose, the face-redirection feature presented in the paper.
  3. We found using a GAN discriminator improves the rendering realisim for our model. FOMM found GAN discriminator is not useful.

@zhanglonghao1992
Copy link

@XiaoWen-AI
Just follow @tcwang0509 's advice. I've made some progress in key points and expression.

@zhengkw18
Copy link

image1
image2
I made some progress these days to get barely satisfactory results. (left to right: source, warped driving, driving, reconstructed)
Still, I think my implementation or my training is far from satisfactory.

  1. I train the model with about 100W random pairs of images from VoxCeleb1. It took 5 days on four 2080Ti, which is under the condition that I drop the Vgg19 pretrained part to save memory. Otherwise, one GPU can bear only one sample per batch.
  2. The model reconstruct faces of different rolls well. But for images with large difference in yaw, the reconstruction is sometimes broken.
  3. The estimated keypoints for different yaws are not consistent or reasonable (for large yaw, the keypoints often concentrate on half of the face, or even out of the face). Also, the keypoints of different persons with different poses are not consistent.
  4. Until now, the expressions are not reconstructed.
  5. The occlusion map seems to have no effects (all values close to 1).

I will continue training to see if the results can improve. It will be great help if anyone has suggestions to my problems.

@charan223
Copy link

charan223 commented Jul 25, 2021

Hello,
What kind of input video preprocessing are you doing? Is there any landmark alignment you are doing to make the images 256x256?
Can you guide me to any codebases/references?
And how much time is it taking for processing the entire voxceleb2 dataset?

@zbdehh
Copy link

zbdehh commented Jul 27, 2021

image1
image2
I made some progress these days to get barely satisfactory results. (left to right: source, warped driving, driving, reconstructed)
Still, I think my implementation or my training is far from satisfactory.

  1. I train the model with about 100W random pairs of images from VoxCeleb1. It took 5 days on four 2080Ti, which is under the condition that I drop the Vgg19 pretrained part to save memory. Otherwise, one GPU can bear only one sample per batch.
  2. The model reconstruct faces of different rolls well. But for images with large difference in yaw, the reconstruction is sometimes broken.
  3. The estimated keypoints for different yaws are not consistent or reasonable (for large yaw, the keypoints often concentrate on half of the face, or even out of the face). Also, the keypoints of different persons with different poses are not consistent.
  4. Until now, the expressions are not reconstructed.
  5. The occlusion map seems to have no effects (all values close to 1).

I will continue training to see if the results can improve. It will be great help if anyone has suggestions to my problems.

HI, we encountered the same problem. The reconstructed face has no expression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants