Lip2Wav

Generate high quality speech from only lip movements. This code is part of the paper: Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis published at CVPR'20.

[Paper] | [Project Page] | [Demo Video]

Highlights

First work to generate intelligible speech from only lip movements in unconstrained settings.
First Multi-speaker Lip to Speech Generation Results
Complete training code and pretrained models made available.
Inference code to generate results from the pre-trained models.
Code to calculate metrics reported in the paper is also made available.

You might also be interested in:

🎉 Lip-sync talking face videos to any speech using Wav2Lip: https://github.com/Rudrabha/Wav2Lip

Prerequisites

Python 3.7.4 (code has been tested with this version)
ffmpeg: sudo apt-get install ffmpeg
Install necessary packages using pip install -r requirements.txt
Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth
Speaker Embeddings pre-trained model at this link should be downloaded (navigate to encoder/saved_models/pretrained.pt) to encoder/saved_models/pretrained.pt.

Getting the weights

Download the weights of our model trained on the LRW dataset.

Preprocessing the LRW dataset

The LRW dataset is organized as follows.

data_root (lrw/ in the below examples)
├── word1
|	├── train, val, test (3 splits)
|	|    ├── *.mp4, *.txt
├── word2
|	├── ...
├── ...

python preprocess.py --data_root lrw/ --preprocessed_root lrw_preprocessed/ --split test

# dump speaker embeddings in the same preprocessed folder
python preprocess_speakers.py --preprocessed_root lrw_preprocessed/

Additional options like batch_size and number of GPUs, split to use can also be set. You should get:

data_root (lrw_preprocessed/ in the above example)
├── word1
|	├── train, val, test (preprocessed splits)
|	|    ├── word1_00001, word1_00002...
|	|    |    ├── *.jpg, mels.npz, ref.npz 
├── word2
|	├── ...
├── ...

Generating for the given test split

python complete_test_generate.py -d lrw_preprocessed/ -r lrw_test_results/ --checkpoint <path_to_checkpoint>

#A sample checkpoint_path  can be found in hparams.py alongside the "eval_ckpt" param.

This will create:

lrw_test_results/
├── gts/  (ground-truth audio files)
|	├── *.wav
├── wavs/ (generated audio files)
|	├── *.wav

Calculating the metrics

You can calculate the PESQ, ESTOI and STOI scores for the above generated results using score.py:

python score.py -r lrw_test_results/

Training

python train.py <name_of_run> --data_root Dataset/chem/

Additional arguments can also be set or passed through --hparams, for details: python train.py -h

License and Citation

The software is licensed under the MIT License. Please cite the following paper if you have use this code:

@InProceedings{Prajwal_2020_CVPR,
author = {Prajwal, K R and Mukhopadhyay, Rudrabha and Namboodiri, Vinay P. and Jawahar, C.V.},
title = {Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}

Acknowledgements

The repository is modified from this TTS repository. We thank the author for this wonderful code. The code for Face Detection has been taken from the face_alignment repository. We thank the authors for releasing their code and models.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
audio_encoder		audio_encoder
encoder_mapping		encoder_mapping
face_detection		face_detection
images		images
synthesizer		synthesizer
utils		utils
.gitignore		.gitignore
20words_mean_face.npy		20words_mean_face.npy
Dockerfile		Dockerfile
License.md		License.md
README.md		README.md
asr.py		asr.py
audio_embedding_server.py		audio_embedding_server.py
audio_utils.py		audio_utils.py
complete_test_generate.py		complete_test_generate.py
compute_mel_mean.py		compute_mel_mean.py
detectors.py		detectors.py
feeder.py		feeder.py
generate_SRAVI_dataset.py		generate_SRAVI_dataset.py
grid_adaptation_experiment.py		grid_adaptation_experiment.py
groundtruth_export_to_feeder_format.py		groundtruth_export_to_feeder_format.py
kaldi_inference.py		kaldi_inference.py
kws_test.py		kws_test.py
lip_embedding_server.py		lip_embedding_server.py
lip_movement_encoder_utils.py		lip_movement_encoder_utils.py
preprocess.py		preprocess.py
preprocess_mouth_roi.py		preprocess_mouth_roi.py
preprocess_speakers.py		preprocess_speakers.py
preprocessor.py		preprocessor.py
preprocessor_2.py		preprocessor_2.py
preprocessor_sv2s.py		preprocessor_sv2s.py
realsense_capture.py		realsense_capture.py
requirements.txt		requirements.txt
sample_pool.py		sample_pool.py
sampler.py		sampler.py
sampler_2.py		sampler_2.py
server.py		server.py
shape_predictor_68_face_landmarks.dat		shape_predictor_68_face_landmarks.dat
similarity_search.py		similarity_search.py
speaker_entanglement_check.py		speaker_entanglement_check.py
sravi_adaptation_experiment.py		sravi_adaptation_experiment.py
syncnet_check.py		syncnet_check.py
text_utils.py		text_utils.py
train.py		train.py
video_augmentation.py		video_augmentation.py
video_cropper_sv2s.py		video_cropper_sv2s.py
video_inference.py		video_inference.py
video_utils.py		video_utils.py
voxceleb_processing.py		voxceleb_processing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lip2Wav

Highlights

You might also be interested in:

Prerequisites

Getting the weights

Preprocessing the LRW dataset

Generating for the given test split

Calculating the metrics

Training

License and Citation

Acknowledgements

About

Releases

Packages

Languages

License

DomhnallBoyle/Lip2Wav

Folders and files

Latest commit

History

Repository files navigation

Lip2Wav

Highlights

You might also be interested in:

Prerequisites

Getting the weights

Preprocessing the LRW dataset

Generating for the given test split

Calculating the metrics

Training

License and Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages