Skip to content

DomhnallBoyle/Lip2Wav

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lip2Wav

Generate high quality speech from only lip movements. This code is part of the paper: Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis published at CVPR'20.

[Paper] | [Project Page] | [Demo Video]


Highlights

  • First work to generate intelligible speech from only lip movements in unconstrained settings.
  • First Multi-speaker Lip to Speech Generation Results
  • Complete training code and pretrained models made available.
  • Inference code to generate results from the pre-trained models.
  • Code to calculate metrics reported in the paper is also made available.

You might also be interested in:

🎉 Lip-sync talking face videos to any speech using Wav2Lip: https://github.com/Rudrabha/Wav2Lip

Prerequisites

  • Python 3.7.4 (code has been tested with this version)
  • ffmpeg: sudo apt-get install ffmpeg
  • Install necessary packages using pip install -r requirements.txt
  • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth
  • Speaker Embeddings pre-trained model at this link should be downloaded (navigate to encoder/saved_models/pretrained.pt) to encoder/saved_models/pretrained.pt.

Getting the weights

Download the weights of our model trained on the LRW dataset.

Preprocessing the LRW dataset

The LRW dataset is organized as follows.

data_root (lrw/ in the below examples)
├── word1
|	├── train, val, test (3 splits)
|	|    ├── *.mp4, *.txt
├── word2
|	├── ...
├── ...
python preprocess.py --data_root lrw/ --preprocessed_root lrw_preprocessed/ --split test

# dump speaker embeddings in the same preprocessed folder
python preprocess_speakers.py --preprocessed_root lrw_preprocessed/

Additional options like batch_size and number of GPUs, split to use can also be set. You should get:

data_root (lrw_preprocessed/ in the above example)
├── word1
|	├── train, val, test (preprocessed splits)
|	|    ├── word1_00001, word1_00002...
|	|    |    ├── *.jpg, mels.npz, ref.npz 
├── word2
|	├── ...
├── ...

Generating for the given test split

python complete_test_generate.py -d lrw_preprocessed/ -r lrw_test_results/ --checkpoint <path_to_checkpoint>

#A sample checkpoint_path  can be found in hparams.py alongside the "eval_ckpt" param.

This will create:

lrw_test_results/
├── gts/  (ground-truth audio files)
|	├── *.wav
├── wavs/ (generated audio files)
|	├── *.wav

Calculating the metrics

You can calculate the PESQ, ESTOI and STOI scores for the above generated results using score.py:

python score.py -r lrw_test_results/

Training

python train.py <name_of_run> --data_root Dataset/chem/

Additional arguments can also be set or passed through --hparams, for details: python train.py -h

License and Citation

The software is licensed under the MIT License. Please cite the following paper if you have use this code:

@InProceedings{Prajwal_2020_CVPR,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Acknowledgements

The repository is modified from this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

About

Lip movements -> Speech

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages