Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How much time do you need to lip sync a 10 sec or 1 minute video? #584

Open
AIhasArrived opened this issue Nov 10, 2023 · 14 comments
Open

Comments

@AIhasArrived
Copy link

I have been trying the last days with both wav2lip HD (not in auto) and retalker, and found that both are slow and very GPU consuming.
I would like to know everyone of you HOW MUCH GPU do you use (what card) and HOW MUCH time does it take for you to do it? What kind of videos/animations are you lip syncing and for how long? (How much time to train X seconds/minutes?)

Please contribute. Because I am about to drop this technology and give up on it, maybe others peoples experiences will give me hope. Maybe this repo is faster? (could not try it yet)

@sahreen-haider
Copy link

I have been trying the last days with both wav2lip HD (not in auto) and retalker, and found that both are slow and very GPU consuming. I would like to know everyone of you HOW MUCH GPU do you use (what card) and HOW MUCH time does it take for you to do it? What kind of videos/animations are you lip syncing and for how long? (How much time to train X seconds/minutes?)

Please contribute. Because I am about to drop this technology and give up on it, maybe others peoples experiences will give me hope. Maybe this repo is faster? (could not try it yet)

well I have tried this repository with colab only and it seems fine if you are trying to merge a video of a certain length basically up to 25-30 secs, then anything after that is gonna take a lot of time.
Given that free version of colab gives you GPU: Nvidia K80/T4, GPU RAM: 12gb-16gb, TBH that's fine for free version.
For your reference lip syncing 1 minute long video would take anywhere from ~ 4 to ~ 5 minutes.

Rather if you want to increase the performance of the model, These are some of the things you can try:

  • use some alternative face recognition model a lighter one, The one which is used in the model is 'sfd' which is taken from another face detection model, Rather you can use there alternative model which is a faster than "sfd" that is "dlib"https://github.com/1adrianb/face-alignment.git.
  • checking If your video doesn't have many cuts between frames.
  • trying with lower resolution videos, since the model itself was trained on videos of resolution 720 p.

@davidkundrats
Copy link

I have been running this model on 1080p input videos between 10-30 seconds long on my machine (rtx 3060 12gb vram) and have had to set the --rescale argument for inference.py to 3 to not run out of memory. To generate a lipsync'd clip it takes a little over a minute. I also had to modify the code in order to run this locally on my machine for the preprocessing and discriminator training scripts.

If you want to get this working on your machine I would suggest using environment setup described here: https://github.com/natlamir/Wav2Lip-WebUI

@AIhasArrived
Copy link
Author

I have been trying the last days with both wav2lip HD (not in auto) and retalker, and found that both are slow and very GPU consuming. I would like to know everyone of you HOW MUCH GPU do you use (what card) and HOW MUCH time does it take for you to do it? What kind of videos/animations are you lip syncing and for how long? (How much time to train X seconds/minutes?)
Please contribute. Because I am about to drop this technology and give up on it, maybe others peoples experiences will give me hope. Maybe this repo is faster? (could not try it yet)

well I have tried this repository with colab only and it seems fine if you are trying to merge a video of a certain length basically up to 25-30 secs, then anything after that is gonna take a lot of time. Given that free version of colab gives you GPU: Nvidia K80/T4, GPU RAM: 12gb-16gb, TBH that's fine for free version. For your reference lip syncing 1 minute long video would take anywhere from ~ 4 to ~ 5 minutes.

Rather if you want to increase the performance of the model, These are some of the things you can try:

  • use some alternative face recognition model a lighter one, The one which is used in the model is 'sfd' which is taken from another face detection model, Rather you can use there alternative model which is a faster than "sfd" that is "dlib"https://github.com/1adrianb/face-alignment.git.
  • checking If your video doesn't have many cuts between frames.
  • trying with lower resolution videos, since the model itself was trained on videos of resolution 720 p.

Thank you , will check it out.

@AIhasArrived
Copy link
Author

I have been running this model on 1080p input videos between 10-30 seconds long on my machine (rtx 3060 12gb vram) and have had to set the --rescale argument for inference.py to 3 to not run out of memory. To generate a lipsync'd clip it takes a little over a minute. I also had to modify the code in order to run this locally on my machine for the preprocessing and discriminator training scripts.

If you want to get this working on your machine I would suggest using environment setup described here: https://github.com/natlamir/Wav2Lip-WebUI

Ok thanks will cehck it, might contact you again if needs be.

@sahreen-haider
Copy link

I have been running this model on 1080p input videos between 10-30 seconds long on my machine (rtx 3060 12gb vram) and have had to set the --rescale argument for inference.py to 3 to not run out of memory. To generate a lipsync'd clip it takes a little over a minute. I also had to modify the code in order to run this locally on my machine for the preprocessing and discriminator training scripts.

If you want to get this working on your machine I would suggest using environment setup described here: https://github.com/natlamir/Wav2Lip-WebUI

Ok thanks will cehck it, might contact you again if needs be.

Sure

@AIhasArrived
Copy link
Author

Hello again @sahreen-haider
,
but how to change the model used for face recognition? That requires a quite bit of coding no?

@sahreen-haider
Copy link

Hello,

The model can be changed with the pertained model for face recognition which is another library,
And yes that will require a bit coding.

@AIhasArrived
Copy link
Author

Is it possible to get help on that? (maybe send me the modified version by PM if you want it to stay not too much spread, I will only use it myself)
I just want a tool that does good lip sync, I have a nice GPU and woudl like to see if I can get some good results,
or maybe point me other/better/different tools I could try, It's desperating, I wish I can find the right tool

@sahreen-haider
Copy link

Hey @AIhasArrived,
I know it could be little difficult to get some good results from the model, since it could require some fine tuning and tampering of parameters, might Also have to change some of the code for the baseline libraries such as face detection and also for GAN (if Advanced towards a more High Definition output).

But I would require some significant time to do this grunt work, unfortunately I might not be able to do this at this time.

But rather you have asked for any alternatives for this,
https://www.sievedata.com/functions/sieve/video_retalking
the above url was posted by some person, it might be a possible alternate solution for your problem,
It is although not Way2Lip, But the issue stated that this alternative could produce much good results as compared to the existing library.

You might want to check it out.

@sahreen-haider
Copy link

@AIhasArrived, Connect with me over this email: sahreenhaider@gmail.com

@AIhasArrived
Copy link
Author

Already did: sent you an email few days ago titled "Contact from github :)"

@Manda69-bit
Copy link

Manda69-bit commented Nov 26, 2023

I have been trying the last days with both wav2lip HD (not in auto) and retalker, and found that both are slow and very GPU consuming. I would like to know everyone of you HOW MUCH GPU do you use (what card) and HOW MUCH time does it take for you to do it? What kind of videos/animations are you lip syncing and for how long? (How much time to train X seconds/minutes?)

Please contribute. Because I am about to drop this technology and give up on it, maybe others peoples experiences will give me hope. Maybe this repo is faster? (could not try it yet)

I can sync 8 sec video in like 15s and time could improve with better parameters. But ,when started i had really 4x slower time and i realized something was just wrong , starting chunks were loading really slow. After doing some research i realized problem is new Torch and GPU not working properly. By following other topics i did try with older versions ex " torch==2.0.1+cu118 and my chunk loading speed increased drastically. Hope it helps, and i hope they fix this shit with a new version.

@AIhasArrived
Copy link
Author

I have been running this model on 1080p input videos between 10-30 seconds long on my machine (rtx 3060 12gb vram) and have had to set the --rescale argument for inference.py to 3 to not run out of memory. To generate a lipsync'd clip it takes a little over a minute. I also had to modify the code in order to run this locally on my machine for the preprocessing and discriminator training scripts.

If you want to get this working on your machine I would suggest using environment setup described here: https://github.com/natlamir/Wav2Lip-WebUI

Hello @davidkundrats I just tried this repo, it looks nice but when I run it I got into a problme (nothing happening while GPU is being used) did you get that problem yourself? and if yes what did you do to solve it? thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@sahreen-haider @davidkundrats @Manda69-bit @AIhasArrived and others